Re: Correct usage of synonyms with Japanese

Michael McCandless Tue, 18 May 2021 07:07:59 -0700

Hi Geoffrey,

[Disclaimer: Geoffrey and I both work at Amazon on customer-facing product
search]

We absolutely must get SynonymGraphFilter consuming input graphs!  This is
just a (serious) bug in it!  But it's just software, let's fix it :)  That
is clearly the right fix, it is just rather fun and challenging. But it is
doable.  Could you open an issue?  I thought we had one for this but cannot
find it now.

I think you are using Kuromoji Japanese tokenizer?  Which produces nice
looking graphs right from the get-go (tokenizer), with compound words also
properly decompounded so both options are indexed/searched.

History: we created SynonymGraphFilter, along with other important
QueryParser (e.g. http://issues.apache.org/jira/browse/LUCENE-7603) and
Query improvements, to get multi-term synonyms working correctly, finally
in Lucene.  With the old SynonymFilter, positional queries involving
multi-term synonyms would have both false positive and false negative hits
... I tried to explain the messy situation here:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

And, finally, with SynonymGraphFilter, used only at search time, and with
tokens consumed by a QueryParser that knows how to turn graphs into correct
positional queries, those bugs are finally fixed -- multi-term synonyms
work correctly.

When used during indexing, SynonymGraphFilter must eventually be followed
by FlattenGraphFilter, because Lucene's index does not store the posLength
attribute of each token.  I.e., the graph is lost anyways during indexing,
so FlattenGraphFilter tries to flatten the graph in the most
information-preserving way (but still loses information, resulting in false
positive/negative hits for positional queries).

Anyways, until we fix this, feeding a graph to SynonymGraphFilter will
indeed mess up its output in weird ways.

This problem has come up several times recently, e.g.
https://issues.apache.org/jira/browse/LUCENE-9173 and
https://issues.apache.org/jira/browse/LUCENE-9123.  There is also the more
revolutionary https://issues.apache.org/jira/browse/LUCENE-5012 but that is
too ambitious for this "small bug", I think.

SynonymGraphFilter also struggles with holes, since they might break the
token graph into two: https://issues.apache.org/jira/browse/LUCENE-8985

For short term workarounds, some possible ideas:

  * I think Kuromoji has an option to NOT produce the compounds/graph
output?  It has an indexing and searching mode.  That might be one
workaround, if maybe you could maybe then move the compounding into
SynonymGraphFilter?  I'm not sure that is possible, in general, since
Kuromoji is using more powerful information (dictionary) to make its graph
choices.

  * Use FlattenGraphFilter immediately before SynonymGraphFilter, and then
again at the end of your analysis chain.  This loses information, since all
tokens are "squashed" onto one another, and we could no longer tell which
sequence of tokens corresponded to which compound word, and it might mean
some synonyms fail to apply when they should have.

  * Go back to SynonymFilter at indexing time.  It will also not fully
handle an input graph correctly, and will necessarily miss some synonyms
that should've applied, but it may produce a more "reasonable" bad output,
and then you shouldn't need FlattenGraphFilter at all.  But test this
carefully to understand what it is doing!

But let's fix the issue for real!

Mike McCandless

http://blog.mikemccandless.com

On Tue, May 18, 2021 at 6:17 AM Geoffrey Lawson <[email protected]>
wrote:

> Hello,
>
> I'm working on a project that involves search in Japanese and uses
> synonyms. The Japanese tokenizer creates an analysis graph, but the
> SynonymGraphFilter states it cannot take a graph as input. After a few
> tests I've seen it can create some unusual outputs if given a graph as
> input. The SynonymFilter is marked deprecated, and has documentation
> pointing out it doesn't handle multiple synonym paths correctly.
>
> My question is what is the 'correct' way to handle synonyms with Japanese
> in Lucene? should the graph be flattened before the SynonymGraphFilter,
> then flattened again after? This seems extra lossy. Is the correct answer
> to make SynonymGraphFilter accept graphs as inputs? is there another option
> that I'm missing?
>
> thanks,
> Geoff
>

Re: Correct usage of synonyms with Japanese

Reply via email to