There already is one: https://issues.apache.org/jira/browse/LUCENE-8776
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla <roman.ch...@gmail.com> wrote: > I'll have to somehow find a solution for this situation, giving up > offsets seems like too big a price to pay, I see that overriding > DefaultIndexingChain is not exactly easy -- the only thing I can think > of is to just trick the classloader into giving it a different version > of the chain (praying this can be done without compromising security, > I have not followed JDK evolutions for some time...) - aside from > forking lucene and editing that; which I decidedly don't want to do > (monkey-patching it, ok, i can live with that... :-)) > > It *seems* to me that the original reason for negative offset checks > stemmed from the fact that vint could have been written (and possibly > vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738 > > but the underlying issue and some of the patches seem to have been > addressing those problems; but a much shorter version of the patch was > committed -- despite the perf results not being indicative (i.e. it > could have been good with the longer patch) -- but to really > understand it, one would have to spend more than 10mins reading the > comments > > Further to the point, I think negative offsets can be produced only on > the very first token, unless there is a bug in a filter (there was/is > a separate check for that in 6x and perhaps it is still there in 7x). > That would be much less restrictive than the current condition which > disallows all backward offsets. We never ran into an index corruption > in lucene 4-6x, so I really wonder if the "forbid all backwards > offsets" approach might be too restrictive. > > Looks like I should create an issue... > > On Thu, Aug 6, 2020 at 11:28 AM Gus Heck <gus.h...@gmail.com> wrote: > > > > I've had a nearly identical experience to what Dave describes, I also > chafe under this restriction. > > > > On Thu, Aug 6, 2020 at 11:07 AM David Smiley <dsmi...@apache.org> wrote: > >> > >> I sympathize with your pain, Roman. > >> > >> It appears we can't really do index-time multi-word synonyms because of > the offset ordering rule. But it's not just synonyms, it's other forms of > multi-token expansion. Where I work, I've seen an interesting approach to > mixed language text analysis in which a sophisticated Tokenizer effectively > re-tokenizes an input multiple ways by producing a token stream that is a > concatenation of different interpretations of the input. On a Lucene > upgrade, we had to "coarsen" the offsets to the point of having highlights > that point to a whole sentence instead of the words in that sentence :-(. > I need to do something to fix this; I'm trying hard to resist modifying our > Lucene fork for this constraint. Maybe instead of concatenating, it might > be interleaved / overlapped but the interpretations aren't necessarily > aligned to make this possible without risking breaking position-sensitive > queries. > >> > >> So... I'm not a fan of this constraint on offsets. > >> > >> ~ David Smiley > >> Apache Lucene/Solr Search Developer > >> http://www.linkedin.com/in/davidwsmiley > >> > >> > >> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla <roman.ch...@gmail.com> > wrote: > >>> > >>> Hi Mike, > >>> > >>> Yes, they are not zero offsets - I was instinctively avoiding > >>> "negative offsets"; but they are indeed backward offsets. > >>> > >>> Here is the token stream as produced by the analyzer chain indexing > >>> "THE HUBBLE constant: a summary of the hubble space telescope program" > >>> > >>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 > >>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 > >>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20 > >>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30 > >>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 > >>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 > offsetEnd=60 > >>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 > >>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60 > >>> term=space pos=1 type=word offsetStart=45 offsetEnd=50 > >>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 > >>> term=program pos=1 type=word offsetStart=61 offsetEnd=68 > >>> > >>> Sometimes, we'll even have a situation when synonyms overlap: for > >>> example "anti de sitter space time" > >>> > >>> "anti de sitter space time" -> "antidesitter space" (one token > >>> spanning offsets 0-26; it gets emitted with the first token "anti" > >>> right now) > >>> "space time" -> "spacetime" (synonym 16-26) > >>> "space" -> "universe" (25-26) > >>> > >>> Yes, weird, but useful if people want to search for `universe NEAR > >>> anti` -- but another usecase which would be prohibited by the "new" > >>> rule. > >>> > >>> DefaultIndexingChain checks new token offset against the last emitted > >>> token, so I don't see a way to emit the multi-token synonym with > >>> offsetts spanning multiple tokens if even one of these tokens was > >>> already emitted. And the complement is equally true: if multi-token is > >>> emitted as last of the group - it trips over `startOffset < > >>> invertState.lastStartOffset` > >>> > >>> > https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 > >>> > >>> > >>> -roman > >>> > >>> On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless > >>> <luc...@mikemccandless.com> wrote: > >>> > > >>> > Hi Roman, > >>> > > >>> > Hmm, this is all very tricky! > >>> > > >>> > First off, why do you call this "zero offsets"? Isn't it "backwards > offsets" that your analysis chain is trying to produce? > >>> > > >>> > Second, in your first example, if you output the tokens in the right > order, they would not violate the "offsets do not go backwards" check in > IndexWriter? I thought IndexWriter is just checking that the startOffset > for a token is not lower than the previous token's startOffset? (And that > the token's endOffset is not lower than its startOffset). > >>> > > >>> > So I am confused why your first example is tripping up on IW's > offset checks. Could you maybe redo the example, listing single token per > line with the start/end offsets they are producing? > >>> > > >>> > Mike McCandless > >>> > > >>> > http://blog.mikemccandless.com > >>> > > >>> > > >>> > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> > wrote: > >>> >> > >>> >> Hello devs, > >>> >> > >>> >> I wanted to create an issue but the helpful message in red letters > >>> >> reminded me to ask first. > >>> >> > >>> >> While porting from lucene 6.x to 7x I'm struggling with a change > that > >>> >> was introduced in LUCENE-7626 > >>> >> (https://issues.apache.org/jira/browse/LUCENE-7626) > >>> >> > >>> >> It is believed that zero offset tokens are bad bad - Mike McCandles > >>> >> made the change which made me automatically doubt myself. I must be > >>> >> wrong, hell, I was living in sin the past 5 years! > >>> >> > >>> >> Sadly, we have been indexing and searching large volumes of data > >>> >> without any corruption in index whatsover, but also without this new > >>> >> change: > >>> >> > >>> >> > https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774 > >>> >> > >>> >> With that change, our multi-token synonyms house of cards is > falling. > >>> >> > >>> >> Mike has this wonderful blogpost explaining troubles with > multi-token synonyms: > >>> >> > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > >>> >> > >>> >> Recommended way to index multi-token synonyms appears to be this: > >>> >> > https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr > >>> >> > >>> >> BUT, but! We don't want to place multi-token synonym into the same > >>> >> position as the other words. We want to preserve their positions! We > >>> >> want to preserve informaiton about offsets! > >>> >> > >>> >> Here is an example: > >>> >> > >>> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE > program > >>> >> > >>> >> This is how it gets indexed > >>> >> > >>> >> [(0, []), > >>> >> (1, ['acr::hubble']), > >>> >> (2, ['constant']), > >>> >> (3, ['summary']), > >>> >> (4, []), > >>> >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', > 'hubble'']), > >>> >> (6, ['acr::space', 'space']), > >>> >> (7, ['acr::telescope', 'telescope']), > >>> >> (8, ['program']), > >>> >> > >>> >> Notice the position 5 - multi-token synonym `syn::hubble space > >>> >> telescope` token is on the first token which started the group > >>> >> (emitted by Lucene's synonym filter). hst is another synonym; we > also > >>> >> index the 'hubble' word there. > >>> >> > >>> >> if you were to search for a phrase "HST program" it will be found > >>> >> because our search parser will search for ("HST ? ? program" | > "Hubble > >>> >> Space Telescope program") > >>> >> > >>> >> It simply found that by looking at synonyms: HST -> Hubble Space > Telescope > >>> >> > >>> >> And because of those funny 'syn::' prefixes, we don't suffer from > the > >>> >> other problem that Mike described -- "hst space" phrase search will > >>> >> NOT find this paper (and that is a correct behaviour) > >>> >> > >>> >> But all of this is possible only because lucene was indexing tokens > >>> >> with offsets that can be lower than the last emitted token; for > >>> >> example 'hubble space telescope' wil have offset 21-45; and the next > >>> >> emitted token "space" will have offset 28-33 > >>> >> > >>> >> And it just works (lucene 6.x) > >>> >> > >>> >> Here is another proof with the appropriate verbiage ("crazy"): > >>> >> > >>> >> > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618 > >>> >> > >>> >> Zero offsets have been working wonderfully for us so far. And I > >>> >> actually cannot imagine how it can work without them - i.e. without > >>> >> the ability to emit a token stream with offsets that are lower than > >>> >> the last seen token. > >>> >> > >>> >> I haven't tried SynonymFlatten filter, but because of this line in > the > >>> >> DefaultIndexingChain - I'm convinced the flatten symbol is not going > >>> >> to do what we need (as seen in the example above) > >>> >> > >>> >> > https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 > >>> >> > >>> >> What would you say? Is it a bug, is it not a bug but just some > special > >>> >> usecase? If it is a special usecase, what do we need to do? Plug in > >>> >> our own indexing chain? > >>> >> > >>> >> Thanks! > >>> >> > >>> >> -roman > >>> >> > >>> >> > --------------------------------------------------------------------- > >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> >> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> > > > > > > -- > > http://www.needhamsoftware.com (work) > > http://www.the111shift.com (play) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >