Re: Dmitry's Term Vector stuff, plus some

Doug Cutting Wed, 25 Feb 2004 14:13:44 -0800

[EMAIL PROTECTED] wrote:

Bruce, Could a short term ( and possibly compromised )solution to your performance problem be to offer only the first 3k of these large 200k docs to the highlighter in order to minimize the amount of tokenization required? Arguably the most relevant bit of a document is typically in the first 1k anyway?

Or perhaps the highlighter could be changed to stop tokenizing a document after 1000 tokens when enough fragments have been found to produce a summary. That way, if there are hits in the first part of the document, which there probably usually are for high-scoring hits, then the time to compute the summary is bounded by something less than the document size.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Dmitry's Term Vector stuff, plus some

Reply via email to