>>Another approach that someone mentioned for solving this problem is to create a >>fragment index for long documents.
Alternatively, could you use term sequence positions to guess where to start extracting text from the doc? If you have identified the best section of the doc based purely on identifying clusters of term positions you can then identify a minumum offset into the doc based on summing all of the preceding term text lengths. This offset could be used to avoid tokenizing all the preamble and you would simply need to tokenize from the chosen offset until you had identified the run of terms that matched your best cluster sequence. I'm not sure if the TermVector support provides the necessary APIs to take this approach? a run of terms that matched the --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]