>>Another approach that someone mentioned for solving this problem is to create a 
>>fragment index for long documents.

Alternatively, could you use term sequence positions to guess where to start 
extracting text from the doc?
If you have identified the best section of the doc based purely on identifying 
clusters of  term positions you can then identify a minumum offset into the doc 
based on summing all of the preceding term text lengths. This offset could be used to 
avoid  tokenizing all the preamble and you would simply need to tokenize 
from the chosen offset until you had identified the run of terms that matched your 
best cluster sequence.
I'm not sure if the TermVector support provides the necessary APIs to take this 
approach?


a run of terms that matched the 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to