[
https://issues.apache.org/jira/browse/LUCENE-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298529#comment-17298529
]
Zach Chen commented on LUCENE-9634:
-----------------------------------
Hi [~dweiss], I took a look at this issue and am also not sure what's the
proper way of fixing it. I'm considering a few possible solutions below, but I
am wondering if there's other better solution as well. Hence I would like to
get your opinion on it before I proceed further (I can also open a PR for
discussion if that's preferred).
For context, the root cause of the issue is that unlike positions read in
*OffsetsFromPositions#get* with *MatchesIterator#startPosition* and
*MatchesIterator#endPosition*, which accounts for *before* / *after* values
properly through *ExtendedIntervalIterator#start* and
*ExtendedIntervalIterator#end* respectively, ** offset read in
*OffsetsFromMatchIterator#get* with *MatchesIterator#startOffset* and
*MatchesIterator#endOffset* doesn't adjust the start and end offset with
*before* / *after* values at all, hence the incorrect offset highlight and the
test failure for *TestMatchRegionRetriever#testDegenerateIntervalsWithOffsets*.
Looking at the other OffsetsRetrievalStrategy implementations such as
*OffsetsFromTokens* and *OffsetsFromValues,* since they didn't store / use
*before* / *after* values either, I suspect they may have the same issue (but I
haven't tested them to confirm yet).
For the solution to this, I'm considering the following two options:
# Deprecate *OffsetsFromMatchIterator* with *OffsetsFromPositions*. These two
appear to have similar implementations, and since supporting position
adjustment with *before* / *after* values in *OffsetsFromMatchIterator*
necessarily requires processing token position information as well, the
processing work involved might be the same with *OffsetsFromPositions* if
*before* / *after* are used. However, under "typical" scenarios where *before*
/ *after* adjustment is not needed, *OffsetsFromPositions* does do more work
than *OffsetsFromMatchIterator* due to the conversion from position to offset
at the end.
# Implement *OffsetsFromMatchIterator* similar to *OffsetsFromTokens* and
*OffsetsFromValues*, by explicitly analyzing and looping over token stream
again. This does require the *before* / *after* values somehow become available
in *OffsetsFromMatchIterator*, which may require some signature change.
Other option includes creating a new class similar to
*ExtendedIntervalIterator*, but handle position adjustment within
*MatchesIterator#startOffset* and *MatchesIterator#endOffset* internally with
token stream processing. But this option also appears to require changing quite
a few signatures so it may not be ideal.
What do you think about the solutions above?
> Highlighting of degenerate spans on fields *with offsets* doesn't work
> properly
> -------------------------------------------------------------------------------
>
> Key: LUCENE-9634
> URL: https://issues.apache.org/jira/browse/LUCENE-9634
> Project: Lucene - Core
> Issue Type: Sub-task
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
>
> Match highlighter works fine with degenerate interval positions when
> {{OffsetsFromPositions}} strategy is used to compute offsets but will show
> incorrect offset ranges if offsets are read from directly from the
> {{MatchIterator}} ({{OffsetsFromMatchIterator}}).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]