Thanks for the background Mike!

I am using the kuromoji tokenizer. Using discardCompoundToken is a good
point. I had not considered that.

For fixing the issue I've created a Jira ticket for it here:
https://issues.apache.org/jira/browse/LUCENE-9966.

geoff

On Tue, May 18, 2021 at 11:07 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi Geoffrey,
>
> [Disclaimer: Geoffrey and I both work at Amazon on customer-facing product
> search]
>
> We absolutely must get SynonymGraphFilter consuming input graphs!  This is
> just a (serious) bug in it!  But it's just software, let's fix it :)  That
> is clearly the right fix, it is just rather fun and challenging. But it is
> doable.  Could you open an issue?  I thought we had one for this but cannot
> find it now.
>
> I think you are using Kuromoji Japanese tokenizer?  Which produces nice
> looking graphs right from the get-go (tokenizer), with compound words also
> properly decompounded so both options are indexed/searched.
>
> History: we created SynonymGraphFilter, along with other important
> QueryParser (e.g. http://issues.apache.org/jira/browse/LUCENE-7603) and
> Query improvements, to get multi-term synonyms working correctly, finally
> in Lucene.  With the old SynonymFilter, positional queries involving
> multi-term synonyms would have both false positive and false negative hits
> ... I tried to explain the messy situation here:
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>
> And, finally, with SynonymGraphFilter, used only at search time, and with
> tokens consumed by a QueryParser that knows how to turn graphs into correct
> positional queries, those bugs are finally fixed -- multi-term synonyms
> work correctly.
>
> When used during indexing, SynonymGraphFilter must eventually be followed
> by FlattenGraphFilter, because Lucene's index does not store the posLength
> attribute of each token.  I.e., the graph is lost anyways during indexing,
> so FlattenGraphFilter tries to flatten the graph in the most
> information-preserving way (but still loses information, resulting in false
> positive/negative hits for positional queries).
>
> Anyways, until we fix this, feeding a graph to SynonymGraphFilter will
> indeed mess up its output in weird ways.
>
> This problem has come up several times recently, e.g.
> https://issues.apache.org/jira/browse/LUCENE-9173 and
> https://issues.apache.org/jira/browse/LUCENE-9123.  There is also the
> more revolutionary https://issues.apache.org/jira/browse/LUCENE-5012 but
> that is too ambitious for this "small bug", I think.
>
> SynonymGraphFilter also struggles with holes, since they might break the
> token graph into two: https://issues.apache.org/jira/browse/LUCENE-8985
>
> For short term workarounds, some possible ideas:
>
>   * I think Kuromoji has an option to NOT produce the compounds/graph
> output?  It has an indexing and searching mode.  That might be one
> workaround, if maybe you could maybe then move the compounding into
> SynonymGraphFilter?  I'm not sure that is possible, in general, since
> Kuromoji is using more powerful information (dictionary) to make its graph
> choices.
>
>   * Use FlattenGraphFilter immediately before SynonymGraphFilter, and then
> again at the end of your analysis chain.  This loses information, since all
> tokens are "squashed" onto one another, and we could no longer tell which
> sequence of tokens corresponded to which compound word, and it might mean
> some synonyms fail to apply when they should have.
>
>   * Go back to SynonymFilter at indexing time.  It will also not fully
> handle an input graph correctly, and will necessarily miss some synonyms
> that should've applied, but it may produce a more "reasonable" bad output,
> and then you shouldn't need FlattenGraphFilter at all.  But test this
> carefully to understand what it is doing!
>
> But let's fix the issue for real!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, May 18, 2021 at 6:17 AM Geoffrey Lawson <
> geoffrey.laws...@gmail.com> wrote:
>
>> Hello,
>>
>> I'm working on a project that involves search in Japanese and uses
>> synonyms. The Japanese tokenizer creates an analysis graph, but the
>> SynonymGraphFilter states it cannot take a graph as input. After a few
>> tests I've seen it can create some unusual outputs if given a graph as
>> input. The SynonymFilter is marked deprecated, and has documentation
>> pointing out it doesn't handle multiple synonym paths correctly.
>>
>> My question is what is the 'correct' way to handle synonyms with Japanese
>> in Lucene? should the graph be flattened before the SynonymGraphFilter,
>> then flattened again after? This seems extra lossy. Is the correct answer
>> to make SynonymGraphFilter accept graphs as inputs? is there another
>> option
>> that I'm missing?
>>
>> thanks,
>> Geoff
>>
>

Reply via email to