[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Michael Gibney (Jira) Mon, 17 Aug 2020 11:56:21 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179175#comment-17179175
 ]


Michael Gibney commented on LUCENE-8776:
----------------------------------------

Roman, I think we're on the same page re: positionLength. I read the original 
case on this issue (re: double-emission of tokens) as a sympathetic type of "XY 
problem", and I'm suggesting that _indexing_ positionLength (LUCENE-4312 – as 
opposed to simply _using_ unindexed positionLength) would be a better 
fundamental way to address the "Y" case (working positional queries) than 
accommodating the "X" case (double-emission of tokens, which may be "less" 
broken, but afaict is still broken in its own way).

I also need to apologize, I was indeed overlooking something: although the 
asterisked terms in the examples you shared above still don't seem problematic 
to me (and I still see no problem with the "THE HUBBLE constant: a summary of 
the hubble space telescope program" example), I see that the latter two 
examples ("MIT and anti de sitter space-time" and "Massachusetts Institute of 
Technology and antidesitter space-time") each have one (and only one?) problem: 
a backward startOffset on the _last_ token. A couple of random thoughts on that:
 # because the {{positionIncrement}} of each of these is "0", it would be 
possible in principle to swap with the preceding token to satisfy the 
constraints enforced by DefaultIndexingChain. This isn't an argument that the 
issue is irrelevant; rather, it's a wish for another example that _can't_ in 
principle be "solved" in such a way.
 # In the "gut feeling" department: I'm a little wary of this being on the 
_last_ token (tokenStream components can exhibit unusual behavior at the 
beginning/end). If I were troubleshooting, I'd probably first add an extra term 
at the end of each input field value and see how this affects things, just as a 
sanity check before digging deeper.

FWIW, I was asking about the analysis chain for context; it's not so much the 
_complexity_ of the analysis chain that prevents me from trying to reproduce 
locally as the fact that it uses several custom components (and some of those 
based on deprecated implementations) ...

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to