[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178036#comment-17178036
 ] 

David Smiley commented on LUCENE-8776:
--------------------------------------

(from my message to the dev list last week):  Where I work, I've seen an 
interesting approach to mixed language text analysis in which a sophisticated 
Tokenizer effectively re-tokenizes an input multiple ways by producing a token 
stream that is a concatenation of different interpretations of the input.  On a 
Lucene upgrade, we had to "coarsen" the offsets to the point of having 
highlights that point to a whole sentence instead of the words in that sentence 
:-(

I'll make up an example in a generic way.  Our actual tests are unintelligible 
to me as they use a bunch of CJK characters that are foreign to me :-)

Imagine input chars "abcdef" and two tokenizers, each emitting simple linear 
chains (no graphs).  Tokenizer1 emits: "abc", "def".  Tokenizer2 emits: "ab", 
"cd", "ef".  I believe it's impossible to combine both tokenizers in such a way 
that preserves _both_ position and offset constraints.  One of them has to 
break.  If Lucene would let offsets go backwards (as it used to allow it), we 
simply concatenate the streams -- tokenizer1's tokens then tokenizer2's tokens. 
 It's important that someone writing in the "language" of tokenizer1 can do 
position sensitive queries for say "abcdef" and likewise for someone writing in 
the language of tokenizer2 to do do queries of "abcd" or "cdef" and have them 
work.

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to