Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Michael McCandless Thu, 06 Aug 2020 03:18:05 -0700

Hi Roman,

Hmm, this is all very tricky!


First off, why do you call this "zero offsets"?  Isn't it "backwards
offsets" that your analysis chain is trying to produce?

Second, in your first example, if you output the tokens in the right order,
they would not violate the "offsets do not go backwards" check in
IndexWriter?  I thought IndexWriter is just checking that the startOffset
for a token is not lower than the previous token's startOffset?  (And that
the token's endOffset is not lower than its startOffset).

So I am confused why your first example is tripping up on IW's offset
checks.  Could you maybe redo the example, listing single token per line
with the start/end offsets they are producing?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla <roman.ch...@gmail.com> wrote:

> Hello devs,
>
> I wanted to create an issue but the helpful message in red letters
> reminded me to ask first.
>
> While porting from lucene 6.x to 7x I'm struggling with a change that
> was introduced in LUCENE-7626
> (https://issues.apache.org/jira/browse/LUCENE-7626)
>
> It is believed that zero offset tokens are bad bad - Mike McCandles
> made the change which made me automatically doubt myself. I must be
> wrong, hell, I was living in sin the past 5 years!
>
> Sadly, we have been indexing and searching large volumes of data
> without any corruption in index whatsover, but also without this new
> change:
>
>
> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>
> With that change, our multi-token synonyms house of cards is falling.
>
> Mike has this wonderful blogpost explaining troubles with multi-token
> synonyms:
>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>
> Recommended way to index multi-token synonyms appears to be this:
> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>
> BUT, but! We don't want to place multi-token synonym into the same
> position as the other words. We want to preserve their positions! We
> want to preserve informaiton about offsets!
>
> Here is an example:
>
> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>
> This is how it gets indexed
>
> [(0, []),
> (1, ['acr::hubble']),
> (2, ['constant']),
> (3, ['summary']),
> (4, []),
> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
> (6, ['acr::space', 'space']),
> (7, ['acr::telescope', 'telescope']),
> (8, ['program']),
>
> Notice the position 5 - multi-token synonym `syn::hubble space
> telescope` token is on the first token which started the group
> (emitted by Lucene's synonym filter). hst is another synonym; we also
> index the 'hubble' word there.
>
>  if you were to search for a phrase "HST program" it will be found
> because our search parser will search for ("HST ? ? program" | "Hubble
> Space Telescope program")
>
> It simply found that by looking at synonyms: HST -> Hubble Space Telescope
>
> And because of those funny 'syn::' prefixes, we don't suffer from the
> other problem that Mike described -- "hst space" phrase search will
> NOT find this paper (and that is a correct behaviour)
>
> But all of this is possible only because lucene was indexing tokens
> with offsets that can be lower than the last emitted token; for
> example 'hubble space telescope' wil have offset 21-45; and the next
> emitted token "space" will have offset 28-33
>
> And it just works (lucene 6.x)
>
> Here is another proof with the appropriate verbiage ("crazy"):
>
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618
>
> Zero offsets have been working wonderfully for us so far. And I
> actually cannot imagine how it can work without them - i.e. without
> the ability to emit a token stream with offsets that are lower than
> the last seen token.
>
> I haven't tried SynonymFlatten filter, but because of this line in the
> DefaultIndexingChain - I'm convinced the flatten symbol is not going
> to do what we need (as seen in the example above)
>
>
> https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915
>
> What would you say? Is it a bug, is it not a bug but just some special
> usecase? If it is a special usecase, what do we need to do? Plug in
> our own indexing chain?
>
> Thanks!
>
>   -roman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Reply via email to