Ram Venkat created LUCENE-8776:
----------------------------------

             Summary: Start offset going backwards has a legitimate purpose
                 Key: LUCENE-8776
                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/search
    Affects Versions: 7.6
            Reporter: Ram Venkat


Here is the use case where startOffset can go backwards:

Say there is a line "Organic light-emitting-diode glows", and I want to run 
span queries and highlight them properly. 

During index time, light-emitting-diode is split into three words, which allows 
me to search for 'light', 'emitting' and 'diode' individually. The three words 
occupy adjacent positions in the index, as 'light' adjacent to 'emitting' and 
'light' at a distance of two words from 'diode' need to match this word. So, 
the order of words after splitting are: Organic, light, emitting, diode, glows. 

But, I also want to search for 'organic' being adjacent to 
'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 

The way I solved this was to also generate 'light-emitting-diode' at two 
positions: (a) In the same position as 'light' and (b) in the same position as 
'glows', like below:
||organic||light||emitting||diode||glows||
| |light-emitting-diode| |light-emitting-diode| |
|0|1|2|3|4|

The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
are obviously the same. This works beautifully in Lucene 5.x in both searching 
and highlighting with span queries. 

But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
backwards" at DefaultIndexingChain:818. This IllegalArgumentException is being 
thrown without any comments on why this check is needed. As I explained above, 
startOffset going backwards is perfectly valid, to deal with word splitting and 
span operations on these specialized use cases. On the other hand, it is not 
clear what value is added by this check and which highlighter code is affected 
by offsets going backwards. This same check is done at 
BaseTokenStreamTestCase:245. 

I see others talk about how this check found bugs in WordDelimiter etc. but it 
also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to