When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Roman Chyla Wed, 05 Aug 2020 15:42:04 -0700

Hello devs,

I wanted to create an issue but the helpful message in red letters
reminded me to ask first.


While porting from lucene 6.x to 7x I'm struggling with a change that
was introduced in LUCENE-7626
(https://issues.apache.org/jira/browse/LUCENE-7626)

It is believed that zero offset tokens are bad bad - Mike McCandles
made the change which made me automatically doubt myself. I must be
wrong, hell, I was living in sin the past 5 years!

Sadly, we have been indexing and searching large volumes of data
without any corruption in index whatsover, but also without this new
change:

https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774

With that change, our multi-token synonyms house of cards is falling.

Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Recommended way to index multi-token synonyms appears to be this:
https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr

BUT, but! We don't want to place multi-token synonym into the same
position as the other words. We want to preserve their positions! We
want to preserve informaiton about offsets!

Here is an example:

* THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program

This is how it gets indexed

[(0, []),
(1, ['acr::hubble']),
(2, ['constant']),
(3, ['summary']),
(4, []),
(5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
(6, ['acr::space', 'space']),
(7, ['acr::telescope', 'telescope']),
(8, ['program']),

Notice the position 5 - multi-token synonym `syn::hubble space
telescope` token is on the first token which started the group
(emitted by Lucene's synonym filter). hst is another synonym; we also
index the 'hubble' word there.

 if you were to search for a phrase "HST program" it will be found
because our search parser will search for ("HST ? ? program" | "Hubble
Space Telescope program")

It simply found that by looking at synonyms: HST -> Hubble Space Telescope

And because of those funny 'syn::' prefixes, we don't suffer from the
other problem that Mike described -- "hst space" phrase search will
NOT find this paper (and that is a correct behaviour)

But all of this is possible only because lucene was indexing tokens
with offsets that can be lower than the last emitted token; for
example 'hubble space telescope' wil have offset 21-45; and the next
emitted token "space" will have offset 28-33

And it just works (lucene 6.x)

Here is another proof with the appropriate verbiage ("crazy"):

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618

Zero offsets have been working wonderfully for us so far. And I
actually cannot imagine how it can work without them - i.e. without
the ability to emit a token stream with offsets that are lower than
the last seen token.

I haven't tried SynonymFlatten filter, but because of this line in the
DefaultIndexingChain - I'm convinced the flatten symbol is not going
to do what we need (as seen in the example above)

https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915

What would you say? Is it a bug, is it not a bug but just some special
usecase? If it is a special usecase, what do we need to do? Plug in
our own indexing chain?

Thanks!

  -roman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

When zero offsets are not bad - a.k.a. multi-token synonyms yet again

Reply via email to