[
https://issues.apache.org/jira/browse/LUCENE-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-8121:
---------------------------------
Attachment: LUCENE-2287_UH_SpanCollector.patch
* Added Passage.toString, useful for debugging and in tests
* Rewrote a large chunk of my last patch in PhraseHelper. I want to prevent
the same term in different SpanQueries from yielding two OffsetsEnum for the
same term with different freqs. I could get into the nitty gritty but anyone
who is curious just read the (commented) patch. I removed the two methods I
had taken from Luwak since this refactoring didn't mesh with the API contract.
* I resolved the nocommits related to offset storage principally by simply
having the value-side of the map be the SpanCollectedOffsetsEnum which was
modified a bit to not be immutable such that the collector adds to it and then
isn't modified. I use postingsEnum.freq() to size the int arrays; no resizing
needed. I'm really happy with that versus some other things I tried. In the
future it shouldn't be hard to add payload support.
* The patch has a bunch of changes to TestUnifiedHighligher &
TestUnifiedHighlighterMTQ which are improvements to test randomization and not
strictly for this patch.
Note that this change will cause passage scores that involve position-sensitive
queries to be a little different. The old methodology wrapped the PostingsEnum
for each position-sensitive term in a Spans and used the freq of the underlying
term (even if we'd match this term fewer than freq times due to position
sensitivity). Now the freq for position-sensitive terms is accurate -- usually
smaller, which will amount to higher scores for passages.
I think it's ready and I'll commit in a day or two.
> UnifiedHighlighter can highlight terms within SpanNear clauses at unmatched
> positions
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-8121
> URL: https://issues.apache.org/jira/browse/LUCENE-8121
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/highlighter
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Minor
> Fix For: 7.3
>
> Attachments: LUCENE-2287_UH_SpanCollector.patch,
> LUCENE-2287_UH_SpanCollector.patch
>
>
> The UnifiedHighlighter (and original Highlighter) highlight phrases by
> converting to a SpanQuery and using the Spans start and end positions to
> assume that every occurrence of the underlying terms between those positions
> are to be highlighted. But this is inaccurate; see LUCENE-5455 for a good
> example, and also LUCENE-2287. The solution is to use the SpanCollector API
> which was introduced after the phrase matching aspects of those highlighters
> were developed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]