[ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Lauck updated LUCENE-4734:
-------------------------------

    Attachment: lucene-fvh-slop.patch

Tricky problem! I created a patch and modified my test cases (deleted the old 
test case patch).

I'd appreciate any feedback, my solution seems durable and passes all 
highlighter test cases but I took a slightly different approach to finding the 
longest matching phrase.

Also a bonus idea!
The current addIfNoOverlap method assumes that we would never want overlapping 
highlights and throws them out. A better approach might be to allow the user to 
provide a delegate that can add/modify overlapping WeightedPhraseInfo, some 
possible implementations could be:

* first only - current behavior
* merge - creates a single WPI that covers all overlaps
* split - gives priority to the first/existing WPI and add whatever is left 
overlaps
* longest - gives priority to the longest WPI and drop any overlaps
* include all - this use case solves my need to return only the offsets of all 
highlights in a document and perform the highlighting and overlap handling 
myself at a later stage.
                
> FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4734
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4734
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 4.0, 4.1
>            Reporter: Ryan Lauck
>              Labels: fastvectorhighlighter, highlighter
>         Attachments: lucene-fvh-slop.patch
>
>
> If a proximity phrase query overlaps with any other query term it will not be 
> highlighted.
> Example Text:  A B C D E F G
> Example Queries: 
> "B E"~10 D
> (only D will be highlighted)
> "B E"~10 "C F"~10
> (neither phrase will be highlighted)
> This can be traced to the FieldPhraseList constructor's inner while loop. 
> From the first example query, the first TermInfo popped off the stack will be 
> "B". The second TermInfo will be "D" which will not be found in the submap 
> for "B E"~10 and will trigger a failed match.
> I wanted to report this issue before digging into a solution but my first 
> thought is:
> Add an additional int property to QueryPhraseMap to store the maximum 
> possible phrase width for each term based on any proximity searches it is 
> part of (defaulting to zero, in the above examples it would be 10). 
> If a term is popped off the stack that is not a part of a proximity phrase 
> being matched ( currMap.getTermMap(ti.getText()) == null ), it is added to a 
> temporary list until either the longest possible phrase is successfully 
> matched or a term is found outside the maximum possible phrase width.
> After this search is complete, any non-matching terms that were added to the 
> temporary list are pushed back onto the stack to be evaluated again and the 
> temp list is cleared.
> Hopefully this makes sense, and if nobody sees any obvious flaws I may try to 
> create a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to