[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Michael Gibney (Jira) Fri, 14 Aug 2020 11:07:16 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177977#comment-17177977
 ]


Michael Gibney commented on LUCENE-8776:
----------------------------------------

{quote}I see this solution as working around the fact that {{positionLength}} 
is not indexed in Lucene
{quote}
I think that indeed describes the motivation behind [~venkat11]'s example. For 
some such cases, restricting graph tokenStreams to query-time would help; but I 
don't think there's any escaping the fact that there are cases where graph 
tokenStreams would more appropriately be generated at index time (e.g., using 
extra context/information that can _only_ be available at index time). There's 
always LUCENE-4312 :-) ...

I'm also still struggling to understand [[email protected]]'s examples, 
which re-focused attention on this issue; like [~mikemccand], I don't see why 
any of these examples would fail based on the offset constraints in 
{{DefaultIndexingChain}}. AFAICT, there is no "{{offsetEnd < 
lastToken.offsetEnd}}" restriction; each startOffset in the examples above is 
{{>=}} the preceding startOffset, and each endOffset is {{>=}} its 
corresponding startOffset. I feel like I'm missing something. 
[[email protected]], can you share your actual analysis chain config?

No matter what startOffset/endOffsets your tokens have, even if they get 
_really_ crazy, it should be possible (at least in principle) to order tokens 
such that the constraint of "no backward startOffsets" holds, no? I think, as 
[~dsmiley] hints at, where this could (perhaps?) become impossible is if such 
ordering would conflict with ordering based on positions.

To my mind (echoing [~simonw]'s "have clear bounds of what an offset can be and 
in what direction it can grow"): the main benefit of formalizing/enforcing the 
order of tokens/offsets in a tokenStream is so that consumers (including 
indirect consumers, e.g. via postings at query-time) know what to expect. Even 
an index that's not strictly-speaking "corrupt" can be made more 
useful/efficient if types of order that _can_ be enforced are enforced.

Two main questions occur to me:
# Are there use cases that truly cannot be supported (even in principle, never 
mind with the current state of analysis components) with strict ordering 
constraints based on token offsets, positions, and maybe positionLengths?
# Is there enough existing functionality that people have built around the 
historic _lack_ of constraints that the horse has left the barn, and the 
ability to toggle this behavior off should be provided, absent some practical 
compulsion otherwise (e.g., actual index incompatibility/corruption, as opposed 
to simply sanity-checking input)?

Perhaps a bit off-topic, but ideally I could see:
# indexed positionLength
# strict ordering of tokens by increasing position (this is the case by 
definition, I think?), and for a given position, order by increasing 
positionLength
# offsets compatible with positions such that the above position-based ordering 
would also result in ordering of tokens by increasing startOffset (perhaps even 
adding the constraint that for a given startOffset, endOffset would never 
decrease?)

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to