Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

Grant Ingersoll Mon, 14 Jan 2008 18:47:03 -0800


On Jan 14, 2008, at 4:49 PM, Mark Miller wrote:

While the overall framework of LUCENE-663 appears similar to thecurrent contrib Highlighter, the code is actually quite differentand I do not think it handles as many corner cases in its currentstate. LUCENE-663 supports PhraseQuerys by implementing 'special'search logic that inspects positional information to make sure theTokens from a PhraseQuery are in order. I am not sure how exact thislogic is compared to Lucenes PhraseQuery search logic, but a cursorylook makes me think its not complete. It almost looks to me that itonly does inorder with simple slop (not edit distance)...I am toolazy to check further though and I may have missed something. Also,LUCENE-663 does not support Span queries.
This patch differs in that it fits the current Highlighter frameworkwithout modifying it, and it uses Lucene's own internal search logicto identify Spans for highlighting. PhraseQueries are handled by aSpanQuery approximation.
As far as PhraseQuery/SpanQuery highlighting, I don't think any ofthe other Highlighter packages offer much. I think that things couldbe done a little faster, but that would require abandoning thecurrent framework, and with all of the corner cases it now supports,I'd hate to see that.
The other Highlighter code that is worth consideration isLUCENE-644. It does abandon the current Highlighter framework andgoes with an attack that is much more efficient for largerdocuments: instead of attacking the problem by spinning through allof the document tokens and comparing to query tokens, 644 just looksat the tokens from the query and grabs the original text using theoffsets from those tokens. This is darn fast, but doesnt go wellwith positional highlighting and I wonder how well it supports allof the corner cases that arise with overlapping tokens and whatnot.

Hmm, I'm beginning to think that the performance issue may be overcometo some extent with the new TermVectorMapper stuff. Basic idea isthat you construct a highlighter that does the appropriatehighlighting as the TV is being loaded from disk through the Mapfunction. This would save having to go back through all the tokens asecond time, but probably has other issues. It's just a thought in myhead at this point. At a minimum, I think the TVM could speed up theTokenSources part that creates the TokenStream based on the TermVector.


At any rate, I am going to think some more on it.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

Reply via email to