[ 
https://issues.apache.org/jira/browse/LUCENE-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-8121:
---------------------------------
    Attachment: LUCENE-2287_UH_SpanCollector.patch

* Added Passage.toString, useful for debugging and in tests
* Rewrote a large chunk of my last patch in PhraseHelper.  I want to prevent 
the same term in different SpanQueries from yielding two OffsetsEnum for the 
same term with different freqs.  I could get into the nitty gritty but anyone 
who is curious just read the (commented) patch.  I removed the two methods I 
had taken from Luwak since this refactoring didn't mesh with the API contract.
* I resolved the nocommits related to offset storage principally by simply 
having the value-side of the map be the SpanCollectedOffsetsEnum which was 
modified a bit to not be immutable such that the collector adds to it and then 
isn't modified.  I use postingsEnum.freq() to size the int arrays; no resizing 
needed. I'm really happy with that versus some other things I tried.  In the 
future it shouldn't be hard to add payload support.
* The patch has a bunch of changes to TestUnifiedHighligher & 
TestUnifiedHighlighterMTQ which are improvements to test randomization and not 
strictly for this patch.

Note that this change will cause passage scores that involve position-sensitive 
queries to be a little different.  The old methodology wrapped the PostingsEnum 
for each position-sensitive term in a Spans and used the freq of the underlying 
term (even if we'd match this term fewer than freq times due to position 
sensitivity).  Now the freq for position-sensitive terms is accurate -- usually 
smaller, which will amount to higher scores for passages.

I think it's ready and I'll commit in a day or two.

> UnifiedHighlighter can highlight terms within SpanNear clauses at unmatched 
> positions
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8121
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8121
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 7.3
>
>         Attachments: LUCENE-2287_UH_SpanCollector.patch, 
> LUCENE-2287_UH_SpanCollector.patch
>
>
> The UnifiedHighlighter (and original Highlighter) highlight phrases by 
> converting to a SpanQuery and using the Spans start and end positions to 
> assume that every occurrence of the underlying terms between those positions 
> are to be highlighted.  But this is inaccurate; see LUCENE-5455 for a good 
> example, and also LUCENE-2287.  The solution is to use the SpanCollector API 
> which was introduced after the phrase matching aspects of those highlighters 
> were developed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to