[
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177740#comment-17177740
]
Dawid Weiss commented on LUCENE-8776:
-------------------------------------
I don't think it can be rearranged for both positions and offsets to be
non-decreasing given the positions are fixed (so that they match on spans)?
{code}
organic light emitting diode glows
| | | | | | | |
0 5 10 15 20 25 30 3 5
pos term(s) offset (inclusive)
0 organic 0-6
1 light 8-12
1 light-emitting-diode 8-27
2 emitting 14-21
3 diode 23-27
3 light-emitting-diode 8-27
4 glows 29-33
{code}
> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 7.6
> Reporter: Ram Venkat
> Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run
> span queries and highlight them properly.
> During index time, light-emitting-diode is split into three words, which
> allows me to search for 'light', 'emitting' and 'diode' individually. The
> three words occupy adjacent positions in the index, as 'light' adjacent to
> 'emitting' and 'light' at a distance of two words from 'diode' need to match
> this word. So, the order of words after splitting are: Organic, light,
> emitting, diode, glows.
> But, I also want to search for 'organic' being adjacent to
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'.
> The way I solved this was to also generate 'light-emitting-diode' at two
> positions: (a) In the same position as 'light' and (b) in the same position
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets
> are obviously the same. This works beautifully in Lucene 5.x in both
> searching and highlighting with span queries.
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is
> being thrown without any comments on why this check is needed. As I explained
> above, startOffset going backwards is perfectly valid, to deal with word
> splitting and span operations on these specialized use cases. On the other
> hand, it is not clear what value is added by this check and which highlighter
> code is affected by offsets going backwards. This same check is done at
> BaseTokenStreamTestCase:245.
> I see others talk about how this check found bugs in WordDelimiter etc. but
> it also prevents legitimate use cases. Can this check be removed?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]