Thanks for the background Mike! I am using the kuromoji tokenizer. Using discardCompoundToken is a good point. I had not considered that.
For fixing the issue I've created a Jira ticket for it here: https://issues.apache.org/jira/browse/LUCENE-9966. geoff On Tue, May 18, 2021 at 11:07 PM Michael McCandless < luc...@mikemccandless.com> wrote: > Hi Geoffrey, > > [Disclaimer: Geoffrey and I both work at Amazon on customer-facing product > search] > > We absolutely must get SynonymGraphFilter consuming input graphs! This is > just a (serious) bug in it! But it's just software, let's fix it :) That > is clearly the right fix, it is just rather fun and challenging. But it is > doable. Could you open an issue? I thought we had one for this but cannot > find it now. > > I think you are using Kuromoji Japanese tokenizer? Which produces nice > looking graphs right from the get-go (tokenizer), with compound words also > properly decompounded so both options are indexed/searched. > > History: we created SynonymGraphFilter, along with other important > QueryParser (e.g. http://issues.apache.org/jira/browse/LUCENE-7603) and > Query improvements, to get multi-term synonyms working correctly, finally > in Lucene. With the old SynonymFilter, positional queries involving > multi-term synonyms would have both false positive and false negative hits > ... I tried to explain the messy situation here: > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > > And, finally, with SynonymGraphFilter, used only at search time, and with > tokens consumed by a QueryParser that knows how to turn graphs into correct > positional queries, those bugs are finally fixed -- multi-term synonyms > work correctly. > > When used during indexing, SynonymGraphFilter must eventually be followed > by FlattenGraphFilter, because Lucene's index does not store the posLength > attribute of each token. I.e., the graph is lost anyways during indexing, > so FlattenGraphFilter tries to flatten the graph in the most > information-preserving way (but still loses information, resulting in false > positive/negative hits for positional queries). > > Anyways, until we fix this, feeding a graph to SynonymGraphFilter will > indeed mess up its output in weird ways. > > This problem has come up several times recently, e.g. > https://issues.apache.org/jira/browse/LUCENE-9173 and > https://issues.apache.org/jira/browse/LUCENE-9123. There is also the > more revolutionary https://issues.apache.org/jira/browse/LUCENE-5012 but > that is too ambitious for this "small bug", I think. > > SynonymGraphFilter also struggles with holes, since they might break the > token graph into two: https://issues.apache.org/jira/browse/LUCENE-8985 > > For short term workarounds, some possible ideas: > > * I think Kuromoji has an option to NOT produce the compounds/graph > output? It has an indexing and searching mode. That might be one > workaround, if maybe you could maybe then move the compounding into > SynonymGraphFilter? I'm not sure that is possible, in general, since > Kuromoji is using more powerful information (dictionary) to make its graph > choices. > > * Use FlattenGraphFilter immediately before SynonymGraphFilter, and then > again at the end of your analysis chain. This loses information, since all > tokens are "squashed" onto one another, and we could no longer tell which > sequence of tokens corresponded to which compound word, and it might mean > some synonyms fail to apply when they should have. > > * Go back to SynonymFilter at indexing time. It will also not fully > handle an input graph correctly, and will necessarily miss some synonyms > that should've applied, but it may produce a more "reasonable" bad output, > and then you shouldn't need FlattenGraphFilter at all. But test this > carefully to understand what it is doing! > > But let's fix the issue for real! > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, May 18, 2021 at 6:17 AM Geoffrey Lawson < > geoffrey.laws...@gmail.com> wrote: > >> Hello, >> >> I'm working on a project that involves search in Japanese and uses >> synonyms. The Japanese tokenizer creates an analysis graph, but the >> SynonymGraphFilter states it cannot take a graph as input. After a few >> tests I've seen it can create some unusual outputs if given a graph as >> input. The SynonymFilter is marked deprecated, and has documentation >> pointing out it doesn't handle multiple synonym paths correctly. >> >> My question is what is the 'correct' way to handle synonyms with Japanese >> in Lucene? should the graph be flattened before the SynonymGraphFilter, >> then flattened again after? This seems extra lossy. Is the correct answer >> to make SynonymGraphFilter accept graphs as inputs? is there another >> option >> that I'm missing? >> >> thanks, >> Geoff >> >