Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

Mark Miller Mon, 14 Jan 2008 13:50:19 -0800

While the overall framework of LUCENE-663 appears similar to the currentcontrib Highlighter, the code is actually quite different and I do notthink it handles as many corner cases in its current state. LUCENE-663supports PhraseQuerys by implementing 'special' search logic thatinspects positional information to make sure the Tokens from aPhraseQuery are in order. I am not sure how exact this logic is comparedto Lucenes PhraseQuery search logic, but a cursory look makes me thinkits not complete. It almost looks to me that it only does inorder withsimple slop (not edit distance)...I am too lazy to check further thoughand I may have missed something. Also, LUCENE-663 does not support Spanqueries.

This patch differs in that it fits the current Highlighter frameworkwithout modifying it, and it uses Lucene's own internal search logic toidentify Spans for highlighting. PhraseQueries are handled by aSpanQuery approximation.

As far as PhraseQuery/SpanQuery highlighting, I don't think any of theother Highlighter packages offer much. I think that things could be donea little faster, but that would require abandoning the currentframework, and with all of the corner cases it now supports, I'd hate tosee that.

The other Highlighter code that is worth consideration is LUCENE-644. Itdoes abandon the current Highlighter framework and goes with an attackthat is much more efficient for larger documents: instead of attackingthe problem by spinning through all of the document tokens and comparingto query tokens, 644 just looks at the tokens from the query and grabsthe original text using the offsets from those tokens. This is darnfast, but doesnt go well with positional highlighting and I wonder howwell it supports all of the corner cases that arise with overlappingtokens and whatnot.


- Mark

Grant Ingersoll (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558784#action_12558784 ]

Grant Ingersoll commented on LUCENE-794:
----------------------------------------

How should this relate to LUCENE-663?  Seems like that one also covers other 
kinds of queries?  I'm no expert in highlighting, but it seems like there is at 
least 3 different issues in JIRA for enabling things like phrase queries, etc.  
 Should we try to consolidate these?

Extend contrib Highlighter to properly support phrase queries and span queries
------------------------------------------------------------------------------

                Key: LUCENE-794
                URL: https://issues.apache.org/jira/browse/LUCENE-794
            Project: Lucene - Java
         Issue Type: Improvement
         Components: Other
           Reporter: Mark Miller
           Priority: Minor
        Attachments: spanhighlighter.patch, spanhighlighter10.patch, 
spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, 
spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, 
spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, 
spanhighlighter_patch_4.zip


This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package 
that scores just like QueryScorer, but scores a 0 for Terms that did not cause 
the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys 
and PhraseQuery. There is also a new Fragmenter that attempts to fragment 
without breaking up Spans.
See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
There is a dependency on MemoryIndex.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

Reply via email to