On Jan 14, 2008, at 4:49 PM, Mark Miller wrote:

While the overall framework of LUCENE-663 appears similar to the current contrib Highlighter, the code is actually quite different and I do not think it handles as many corner cases in its current state. LUCENE-663 supports PhraseQuerys by implementing 'special' search logic that inspects positional information to make sure the Tokens from a PhraseQuery are in order. I am not sure how exact this logic is compared to Lucenes PhraseQuery search logic, but a cursory look makes me think its not complete. It almost looks to me that it only does inorder with simple slop (not edit distance)...I am too lazy to check further though and I may have missed something. Also, LUCENE-663 does not support Span queries.

This patch differs in that it fits the current Highlighter framework without modifying it, and it uses Lucene's own internal search logic to identify Spans for highlighting. PhraseQueries are handled by a SpanQuery approximation.

As far as PhraseQuery/SpanQuery highlighting, I don't think any of the other Highlighter packages offer much. I think that things could be done a little faster, but that would require abandoning the current framework, and with all of the corner cases it now supports, I'd hate to see that.

The other Highlighter code that is worth consideration is LUCENE-644. It does abandon the current Highlighter framework and goes with an attack that is much more efficient for larger documents: instead of attacking the problem by spinning through all of the document tokens and comparing to query tokens, 644 just looks at the tokens from the query and grabs the original text using the offsets from those tokens. This is darn fast, but doesnt go well with positional highlighting and I wonder how well it supports all of the corner cases that arise with overlapping tokens and whatnot.

Hmm, I'm beginning to think that the performance issue may be overcome to some extent with the new TermVectorMapper stuff. Basic idea is that you construct a highlighter that does the appropriate highlighting as the TV is being loaded from disk through the Map function. This would save having to go back through all the tokens a second time, but probably has other issues. It's just a thought in my head at this point. At a minimum, I think the TVM could speed up the TokenSources part that creates the TokenStream based on the TermVector.

At any rate, I am going to think some more on it.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to