[
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942705#comment-14942705
]
Michael McCandless commented on LUCENE-6664:
--------------------------------------------
bq. don't get discouraged,
I'm not discouraged.
Many issues have been blocked before because they are controversial. I
resolved this as "later" because that's the reality of what's happening: it
will be a long time until we can fix these bugs in {{SynonymFilter}}.
This is just how Apache open source works: it's inherently a conservative /
design by committee development model, and one veto to a change blocks it.
Only changes everyone agrees on are allowed. The more successful the project,
the more conservative its development becomes.
The few users who are affected by the buggy {{SynonymFilter}} we have today can
always test {{SynonymGraphFilter}} in this patch with query-time synonyms to
confirm it fixes their phrase query bugs (please report back!).
But users are definitely affected by these bugs today, e.g. see
https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter
where there's lots of exploration on how to work around the bugs that this
patch in fact fixes correctly, if you are willing/able to use query-time
synonyms.
My original blog post on this topic also explains the bugs:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
bq. I recognize that they just don't fit in cleanly with the existing flat
token stream architecture.
I disagree.
This is exactly why we added {{PosLengthAttribute}} originally, and e.g.
Kuromoji makes use of that very well: it produces a graph token stream. I
think there is an overblown/irrational fear of graph tokenizers at work here...
maybe we should remove all graph tokenizers/token filters, along with
{{PosLengthAttribute}}? To arbitrarily declare that only tokenizers (not token
filters) can create new positions makes no sense to me: the resulting output
from the tokenizer or the token filter is indistinguishable.
Furthermore, the graph flattening filter in this patch gives good index-time
back compat if you apply your synonyms during indexing, while also enabling
bug-free query time multi-token synonyms.
bq. enable graph support in the main Lucene query parser.
Right, this is a missing part now. We do have {{TermAutomatonQuery}}, to
execute the full token graph correctly, but we still have to fix the query
parser to somehow produce that query when it "sees" a graph when tokenizing a
phrase query? Maybe that's not so hard, e.g. we could always create a
{{TermAutomatonQuery}} but fix that query to rewrite to a simple
{{PhraseQuery}} or {{MultiPhraseQuery}} if it was an "easy" case?
bq. (Alas, Solr, has its own fork of the Lucene query parser.)
Hmm, why? There are so many query parsers now ...
bq. It would also be good to address the issue with non-phrase terms being
analyzed separately
Hmm what does this mean? I thought the query parsers analyze whole text chunks
between operators, so they could already apply multi-token synonyms not inside
a phrase query?
> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
> Key: LUCENE-6664
> URL: https://issues.apache.org/jira/browse/LUCENE-6664
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch,
> LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself). I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]