[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596490#comment-14596490
]
Michael McCandless commented on LUCENE-6582:
--------------------------------------------
bq. I really hadn't thought of using position lengths as "references", like
this!
It's hard to think about :) But it "just" means the positions become node IDs,
and you must number the nodes "properly" (so that any token always goes from
node X to Y where Y > X).
bq. One problem that I see is that I'll need more buffering
I think that's fine, I think better correctness trumps the added buffering cost.
bq. One other doubt I have is how this affects the indexer. I imagine it saves
position lengths on the index too, so this shouldn't be a problem, right?
The index does NOT record position length today... I think if we fix syn filter
here to produce the correct graph, we should also insert a "sausagizer" phase
that turns this graph back into a sausage for indexing? (So that "what the
fudge" and "wow that's funny" will in fact match a document that had "wtf").
However, if you apply syn filter at search time, we could fix query parsers to
possibly "do the right thing" here, e.g. translating this graph into a union of
phrase queries, or using TermAutomatonQuery (in sandbox still), or something ...
> SynonymFilter should generate a correct (or, at least, better) graph
> --------------------------------------------------------------------
>
> Key: LUCENE-6582
> URL: https://issues.apache.org/jira/browse/LUCENE-6582
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Ian Ribas
> Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png,
> after2.png, after3.png, before.png
>
>
> Some time ago, I had a problem with synonyms and phrase type queries
> (actually, it was elasticsearch and I was using a match query with multiple
> terms and the "and" operator, as better explained here:
> https://github.com/elastic/elasticsearch/issues/10394).
> That issue led to some work on Lucene: LUCENE-6400 (where I helped a little
> with tests) and LUCENE-6401. This issue is also related to LUCENE-3843.
> Starting from the discussion on LUCENE-6400, I'm attempting to implement a
> solution. Here is a patch with a first step - the implementation to fix
> "SynFilter to be able to 'make positions'" (as was mentioned on the
> [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
> In this way, the synonym filter generates a correct (or, at least, better)
> graph.
> As the synonym matching is greedy, I only had to worry about fixing the
> position length of the rules of the current match, no future or past synonyms
> would "span" over this match (please correct me if I'm wrong!). It did
> require more buffering, twice as much.
> The new behavior I added is not active by default, a new parameter has to be
> passed in a new constructor for {{SynonymFilter}}. The changes I made do
> change the token stream generated by the synonym filter, and I thought it
> would be better to let that be a voluntary decision for now.
> I did some refactoring on the code, but mostly on what I had to change for
> may implementation, so that the patch was not too hard to read. I created
> specific unit tests for the new implementation
> ({{TestMultiWordSynonymFilter}}) that should show how things will be with the
> new behavior.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]