[ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15668108#comment-15668108
 ] 

Michael McCandless commented on LUCENE-6664:
--------------------------------------------

I'm re-opening this issue: I think my original patch here is a good way to move 
forward.  It is a simple, backwards compatible way, for token streams to 
naturally produce graphs, and to empower token filters to create new positions.

Existing token streams, that produce posInc=0 or posInc=1 and posLength=1 
tokens, naturally work the way they do today with this change, producing 
"sausage" graphs.

Graph-aware token streams, like the new {{SynonymGraphFilter}} here, the 
Kuromoji {{JapaneseTokenizer}}, and {{WordDelimiterFilter}} if we improve it, 
can produce correct graphs which can be used at query time to make accurate 
queries.

Today, multi-word synonyms are buggy (see 
https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter
 and 
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html), 
missing hits that should match, and incorrectly returning hits that should not 
match, for queries that involve the synonyms.  With this change, if you use 
query time synonym expansion, along with separate improvements to query parser, 
it would fix the bug.  The required changes to query parsing are surprisingly 
contained ... see https://github.com/elastic/elasticsearch/pull/21517 as an 
example approach.

I am not proposing, here, that the Lucene index format be changed to support 
indexing a position graph.  Instead, I'm proposing that we make it possible for 
query-time position graphs to work correctly, so multi-token synonyms are no 
longer buggy, and I think this is a good way to make that happen.

> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
>                 Key: LUCENE-6664
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6664
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
> LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to