[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Roman (Jira) Fri, 14 Aug 2020 09:13:48 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177852#comment-17177852
 ]


Roman commented on LUCENE-8776:
-------------------------------

{{Sorry for crossposting (into the forum and here); I will try to study 
[~dweiss] example, but here is some useful writeup – please jump to the last 
example; where PositionLength attribute would fail us.}}

{{`assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",}}
{{        "title", "THE HUBBLE constant: a summary of the hubble space 
telescope program"));`}}


{{`term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10}}
{{term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10}}
{{term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20}}
{{term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30}}
{{term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44}}
{{term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM 
offsetStart=38 offsetEnd=60}}
{{term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60}}
{{term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60}}
{{* term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50}}
{{term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60}}
{{term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68`}}

{{* - fails because of offsetEnd < lastToken.offsetEnd; If reordered (the 
multi-token synonym emitted as a last token) it would fail as well, because of 
the check for lastToken.beginOffset < currentToken.beginOffset. Basically, any 
reordering would result in a failure (unless offsets are trimmed).}}



{{The following example has additional twist because of `space-time`; the 
tokenizer first splits the word and generate two new tokens -- those 
alternative tokens are then used to find synonyms (space == universe)}}

{{`assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",}}
{{        "title", "MIT and anti de sitter space-time"));`}}


{{`term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13}}
{{term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3}}
{{term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3}}
{{term=syn::massachusetts institute of technology posInc=0 posLen=1 
type=SYNONYM offsetStart=0 offsetEnd=3}}
{{term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3}}
{{term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3}}
{{term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12}}
{{term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28}}
{{term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28}}
{{term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 
offsetEnd=28}}
{{term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM offsetStart=8 
offsetEnd=28}}
{{term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 
offsetEnd=28}}
{{* term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15}}
{{term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22}}
{{term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28}}
{{term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 
offsetEnd=28}}
{{term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33}}
{{term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33`}}

{{So far, all of these cases could be handled with the new position length 
attribute. But let us look at a case where that would fail too.}}

{{`assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",}}
{{        "title", "Massachusetts Institute of Technology and antidesitter 
space-time"));`}}


{{`term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12}}
{{term=syn::massachusetts institute of technology posInc=0 posLen=4 
type=SYNONYM offsetStart=0 offsetEnd=36}}
{{term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36}}
{{term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36}}
{{term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22}}
{{term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36}}
{{term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53}}
{{term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59}}
{{term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59}}
{{term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 
offsetEnd=59}}
{{term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM 
offsetStart=41 offsetEnd=59}}
{{term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 
offsetEnd=59}}
{{term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59}}
{{term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 
offsetEnd=59}}
{{term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64}}
{{term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64`}}

{{Notice the posLen=4 of MIT; it would cover tokens `massachusetts institute 
technology antidesitter` while offsets are still correct.}}

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to