Hi Geoffrey, [Disclaimer: Geoffrey and I both work at Amazon on customer-facing product search]
We absolutely must get SynonymGraphFilter consuming input graphs! This is just a (serious) bug in it! But it's just software, let's fix it :) That is clearly the right fix, it is just rather fun and challenging. But it is doable. Could you open an issue? I thought we had one for this but cannot find it now. I think you are using Kuromoji Japanese tokenizer? Which produces nice looking graphs right from the get-go (tokenizer), with compound words also properly decompounded so both options are indexed/searched. History: we created SynonymGraphFilter, along with other important QueryParser (e.g. http://issues.apache.org/jira/browse/LUCENE-7603) and Query improvements, to get multi-term synonyms working correctly, finally in Lucene. With the old SynonymFilter, positional queries involving multi-term synonyms would have both false positive and false negative hits ... I tried to explain the messy situation here: http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html And, finally, with SynonymGraphFilter, used only at search time, and with tokens consumed by a QueryParser that knows how to turn graphs into correct positional queries, those bugs are finally fixed -- multi-term synonyms work correctly. When used during indexing, SynonymGraphFilter must eventually be followed by FlattenGraphFilter, because Lucene's index does not store the posLength attribute of each token. I.e., the graph is lost anyways during indexing, so FlattenGraphFilter tries to flatten the graph in the most information-preserving way (but still loses information, resulting in false positive/negative hits for positional queries). Anyways, until we fix this, feeding a graph to SynonymGraphFilter will indeed mess up its output in weird ways. This problem has come up several times recently, e.g. https://issues.apache.org/jira/browse/LUCENE-9173 and https://issues.apache.org/jira/browse/LUCENE-9123. There is also the more revolutionary https://issues.apache.org/jira/browse/LUCENE-5012 but that is too ambitious for this "small bug", I think. SynonymGraphFilter also struggles with holes, since they might break the token graph into two: https://issues.apache.org/jira/browse/LUCENE-8985 For short term workarounds, some possible ideas: * I think Kuromoji has an option to NOT produce the compounds/graph output? It has an indexing and searching mode. That might be one workaround, if maybe you could maybe then move the compounding into SynonymGraphFilter? I'm not sure that is possible, in general, since Kuromoji is using more powerful information (dictionary) to make its graph choices. * Use FlattenGraphFilter immediately before SynonymGraphFilter, and then again at the end of your analysis chain. This loses information, since all tokens are "squashed" onto one another, and we could no longer tell which sequence of tokens corresponded to which compound word, and it might mean some synonyms fail to apply when they should have. * Go back to SynonymFilter at indexing time. It will also not fully handle an input graph correctly, and will necessarily miss some synonyms that should've applied, but it may produce a more "reasonable" bad output, and then you shouldn't need FlattenGraphFilter at all. But test this carefully to understand what it is doing! But let's fix the issue for real! Mike McCandless http://blog.mikemccandless.com On Tue, May 18, 2021 at 6:17 AM Geoffrey Lawson <geoffrey.laws...@gmail.com> wrote: > Hello, > > I'm working on a project that involves search in Japanese and uses > synonyms. The Japanese tokenizer creates an analysis graph, but the > SynonymGraphFilter states it cannot take a graph as input. After a few > tests I've seen it can create some unusual outputs if given a graph as > input. The SynonymFilter is marked deprecated, and has documentation > pointing out it doesn't handle multiple synonym paths correctly. > > My question is what is the 'correct' way to handle synonyms with Japanese > in Lucene? should the graph be flattened before the SynonymGraphFilter, > then flattened again after? This seems extra lossy. Is the correct answer > to make SynonymGraphFilter accept graphs as inputs? is there another option > that I'm missing? > > thanks, > Geoff >