Kevin A. Burton wrote:
I'm playing with this package:

http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms.
This seems that it's very inefficient

Does it just seem inefficient, or is is it actually too inefficient in practice? Folks have benchmarked this, and, for documents less than 10k characters or so, re-tokenizing is fast enough. But it can be slow if the majority of your documents are longer than this.


Several solutions have been proposed. The simplest is to not scan past the first 10k or so for snippets unless nothing relevant is found in the first 10k. I don't think Mark's highlighter yet does this, but I might be mistaken.

since lucene already knows the frequency and position of given terms in the index.

Lucene indexes record that a term is the nth term, not that it occurs at the nth character in the text. The latter is needed for highlighting, but storing this would make indexes much larger and slower to update.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to