[ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179143#comment-17179143 ]
Roman commented on LUCENE-8776: ------------------------------- [~mgibney] the original issue in this ticket is about ability to run span queries and highlight them properly. As for positions, posLen attribute solves the search part (well, if it was indexed - which it is not) ; but it cannot solve the highlight part - at least I don't see how it could. And I have shown that in this example: {code:java} term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12 term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36 term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36 term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36 term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22 term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36 term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53 term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59 term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59 term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59 term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59 term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59 term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59 term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64 term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64{code} {{Notice the posLen=4 of MIT; position length attribute is **wrong** because of stopwords. It would cover tokens `massachusetts institute technology antidesitter` while offsets are still correct.}} That our chain is complex is true, but I don't think it matters at all, because the tokenizer is producing the stream of tokens listed above – and it is "broken" only in the sense that backward offsets are disallowed (none of our filters are broken in the sense of being 'buggy' - they are broken in the sense of producing what is now deemed "illegal"). And yet the "illegal" in this case, I and some other people would posit, is the "correct" for expert use cases. The discussion will hopefully help revisiting the problem. If I were to add the flatten filter, I'd be able to index the tokens because the filter trims (makes offsets only grow, never go back). But if you think about it: that is impossible proposition with multi-token synonyms if at the same time you want to index the individual tokens that make them. 1: Massachusetts | Massachusetts Institute of Technology 2: institute 3: technology 'MIT' spans three indexed-tokens positions – but it also spans **four original tokens positions** ("Massachusetts Institute of Technology") And if there were two parallel filters inside one tokenizer chain; each of them picking different tokens and paying attention to different stopwords – as [~dsmiley] described - I don't see how can their output be interlaced to progress in a linear fashion (to not trip offset asserts). BTW: The output example is produced by lucene 6.x; when ported to Lucene 7.x it produces exactly same stream - but it trips [https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L917] > Start offset going backwards has a legitimate purpose > ----------------------------------------------------- > > Key: LUCENE-8776 > URL: https://issues.apache.org/jira/browse/LUCENE-8776 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 7.6 > Reporter: Ram Venkat > Priority: Major > Attachments: LUCENE-8776-proof-of-concept.patch > > > Here is the use case where startOffset can go backwards: > Say there is a line "Organic light-emitting-diode glows", and I want to run > span queries and highlight them properly. > During index time, light-emitting-diode is split into three words, which > allows me to search for 'light', 'emitting' and 'diode' individually. The > three words occupy adjacent positions in the index, as 'light' adjacent to > 'emitting' and 'light' at a distance of two words from 'diode' need to match > this word. So, the order of words after splitting are: Organic, light, > emitting, diode, glows. > But, I also want to search for 'organic' being adjacent to > 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. > The way I solved this was to also generate 'light-emitting-diode' at two > positions: (a) In the same position as 'light' and (b) in the same position > as 'glows', like below: > ||organic||light||emitting||diode||glows|| > | |light-emitting-diode| |light-emitting-diode| | > |0|1|2|3|4| > The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets > are obviously the same. This works beautifully in Lucene 5.x in both > searching and highlighting with span queries. > But when I try this in Lucene 7.6, it hits the condition "Offsets must not go > backwards" at DefaultIndexingChain:818. This IllegalArgumentException is > being thrown without any comments on why this check is needed. As I explained > above, startOffset going backwards is perfectly valid, to deal with word > splitting and span operations on these specialized use cases. On the other > hand, it is not clear what value is added by this check and which highlighter > code is affected by offsets going backwards. This same check is done at > BaseTokenStreamTestCase:245. > I see others talk about how this check found bugs in WordDelimiter etc. but > it also prevents legitimate use cases. Can this check be removed? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org