[ 
https://issues.apache.org/jira/browse/LUCENE-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298529#comment-17298529
 ] 

Zach Chen commented on LUCENE-9634:
-----------------------------------

Hi [~dweiss], I took a look at this issue and am also not sure what's the 
proper way of fixing it. I'm considering a few possible solutions below, but I 
am wondering if there's other better solution as well. Hence I would like to 
get your opinion on it before I proceed further (I can also open a PR for 
discussion if that's preferred).

For context, the root cause of the issue is that unlike positions read in 
*OffsetsFromPositions#get* with *MatchesIterator#startPosition* and 
*MatchesIterator#endPosition*, which accounts for *before* / *after* values 
properly through *ExtendedIntervalIterator#start* and 
*ExtendedIntervalIterator#end* respectively, ** offset read in 
*OffsetsFromMatchIterator#get* with *MatchesIterator#startOffset* and 
*MatchesIterator#endOffset* doesn't adjust the start and end offset with 
*before* / *after* values at all, hence the incorrect offset highlight and the 
test failure for *TestMatchRegionRetriever#testDegenerateIntervalsWithOffsets*. 
Looking at the other OffsetsRetrievalStrategy implementations such as 
*OffsetsFromTokens* and *OffsetsFromValues,* since they didn't store / use 
*before* / *after* values either, I suspect they may have the same issue (but I 
haven't tested them to confirm yet). 

For the solution to this, I'm considering the following two options:
 # Deprecate *OffsetsFromMatchIterator* with *OffsetsFromPositions*. These two 
appear to have similar implementations, and since supporting position 
adjustment with *before* / *after* values in *OffsetsFromMatchIterator* 
necessarily requires processing token position information as well, the 
processing work involved might be the same with *OffsetsFromPositions* if 
*before* / *after* are used. However, under "typical" scenarios where *before* 
/ *after* adjustment is not needed, *OffsetsFromPositions* does do more work 
than *OffsetsFromMatchIterator* due to the conversion from position to offset 
at the end.
 # Implement *OffsetsFromMatchIterator* similar to *OffsetsFromTokens* and 
*OffsetsFromValues*, by explicitly analyzing and looping over token stream 
again. This does require the *before* / *after* values somehow become available 
in *OffsetsFromMatchIterator*, which may require some signature change.

Other option includes creating a new class similar to 
*ExtendedIntervalIterator*, but handle position adjustment within 
*MatchesIterator#startOffset* and *MatchesIterator#endOffset*  internally with 
token stream processing. But this option also appears to require changing quite 
a few signatures so it may not be ideal.

What do you think about the solutions above?

> Highlighting of degenerate spans on fields *with offsets* doesn't work 
> properly
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-9634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9634
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>
> Match highlighter works fine with degenerate interval positions when 
> {{OffsetsFromPositions}} strategy is used to compute offsets but will show 
> incorrect offset ranges if offsets are read from directly from the 
> {{MatchIterator}} ({{OffsetsFromMatchIterator}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to