[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Roman (Jira) Mon, 17 Aug 2020 10:33:41 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179143#comment-17179143
 ]


Roman commented on LUCENE-8776:
-------------------------------

[~mgibney] the original issue in this ticket is about ability to run span 
queries and highlight them properly. As for positions, posLen attribute solves 
the search part (well, if it was indexed - which it is not) ; but it cannot 
solve the highlight part - at least I don't see how it could. And I have shown 
that in this example:

 
{code:java}
term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
 term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM 
offsetStart=0 offsetEnd=36
 term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
 term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
 term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
 term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
 term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
 term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 
offsetEnd=59
 term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 
offsetEnd=59
 term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 
offsetEnd=59
 term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
 term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
 term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64{code}
 

{{Notice the posLen=4 of MIT; position length attribute is **wrong** because of 
stopwords. It would cover tokens `massachusetts institute technology 
antidesitter` while offsets are still correct.}}

 

That our chain is complex is true, but I don't think it matters at all, because 
the tokenizer is producing the stream of tokens listed above – and it is 
"broken" only in the sense that backward offsets are disallowed (none of our 
filters are broken in the sense of being 'buggy' - they are broken in the sense 
of producing what is now deemed "illegal"). And yet the "illegal" in this case, 
I and some other people would posit, is the "correct" for expert use cases. The 
discussion will hopefully help revisiting the problem.

 

If I were to add the flatten filter, I'd be able to index the tokens because 
the filter trims (makes offsets only grow, never go back). But if you think 
about it: that is impossible proposition with multi-token synonyms if at the 
same time you want to index the individual tokens that make them.

 

1: Massachusetts | Massachusetts Institute of Technology

2: institute

3: technology

 

'MIT' spans three indexed-tokens positions – but it also spans **four original 
tokens positions** ("Massachusetts Institute of Technology")

And if there were two parallel filters inside one tokenizer chain; each of them 
picking different tokens and paying attention to different stopwords – as 
[~dsmiley] described - I don't see how can their output be interlaced to 
progress in a linear fashion (to not trip offset asserts).

 

BTW: The output example is produced by lucene 6.x; when ported to Lucene 7.x it 
produces exactly same stream - but it trips  
[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L917]

 

 

 

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to