[ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177991#comment-17177991 ]
Dawid Weiss commented on LUCENE-8776: ------------------------------------- bq. The light-emitting-diode token is repeated a 2nd time (at position 3) so that the span/phrase query "light-emitting-diode glows" works correctly? This is the way I understood the original example at least? bq. But then what about the span/phrase query "organic light-emitting-diode glows", which ought to match but I think even for your workaround (double-indexing light-emitting-diode) will not then work? I think you're correct. bq. Yet, Lucene already offers an accurate way to solve all of this, at query time, by properly consuming the token graph output after tokenizing a query (including positionLength of the tokens) and creating a correct query such that all of the above examples would work correctly, without producing two or more light-emitting-diode tokens I never played with such complex graphs (redface). How would this work in indexing/ at query time? Can you write up a test case for the above, Mike? > Start offset going backwards has a legitimate purpose > ----------------------------------------------------- > > Key: LUCENE-8776 > URL: https://issues.apache.org/jira/browse/LUCENE-8776 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 7.6 > Reporter: Ram Venkat > Priority: Major > Attachments: LUCENE-8776-proof-of-concept.patch > > > Here is the use case where startOffset can go backwards: > Say there is a line "Organic light-emitting-diode glows", and I want to run > span queries and highlight them properly. > During index time, light-emitting-diode is split into three words, which > allows me to search for 'light', 'emitting' and 'diode' individually. The > three words occupy adjacent positions in the index, as 'light' adjacent to > 'emitting' and 'light' at a distance of two words from 'diode' need to match > this word. So, the order of words after splitting are: Organic, light, > emitting, diode, glows. > But, I also want to search for 'organic' being adjacent to > 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. > The way I solved this was to also generate 'light-emitting-diode' at two > positions: (a) In the same position as 'light' and (b) in the same position > as 'glows', like below: > ||organic||light||emitting||diode||glows|| > | |light-emitting-diode| |light-emitting-diode| | > |0|1|2|3|4| > The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets > are obviously the same. This works beautifully in Lucene 5.x in both > searching and highlighting with span queries. > But when I try this in Lucene 7.6, it hits the condition "Offsets must not go > backwards" at DefaultIndexingChain:818. This IllegalArgumentException is > being thrown without any comments on why this check is needed. As I explained > above, startOffset going backwards is perfectly valid, to deal with word > splitting and span operations on these specialized use cases. On the other > hand, it is not clear what value is added by this check and which highlighter > code is affected by offsets going backwards. This same check is done at > BaseTokenStreamTestCase:245. > I see others talk about how this check found bugs in WordDelimiter etc. but > it also prevents legitimate use cases. Can this check be removed? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org