[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

Michael McCandless (JIRA) Sun, 21 Jun 2015 03:50:29 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595017#comment-14595017
 ]


Michael McCandless commented on LUCENE-6582:
--------------------------------------------

bq.  changed -10 to -1

Thank you!

bq. Unfortunately, I could not merge the tests into a single file because they 
have different base classes and the new tests depend on asserts on the base 
class.

Ugh, OK, thanks for trying.  It's fine to leave it separate...

bq. The absolute positions still have meaning.

OK but with this change we have now created new nodes (something SynFilter does 
not do today), because wtf now has posLen=3.  This is great (necessary!) for 
SynFilter to be correct...

bq. And there is no difference in the representation of the tokens 
"unexpensive" as a synonym for "cheap" from "wow" as a synonym for "what" on 
the previous rules, using the attributes.

Well, it is entirely possible for PosInc/PosLenAtt to express the correct 
graph, it's just hairy to implement, but I think your patch is part way there 
(it creates new positions!).

E.g. here's a sequence of tokens that would be the fully correct graph output 
for the wtf example:

||token||posInc||posLen||
|wtf|1|5|
|what|0|1|
|wow|0|3|
|the|1|1|
|fudge|1|3|
|that's|1|1|
|funny|1|1|
|happened|1|1|

It corresponds to this graph:

!after3.png!

The token posInc/posLen is just a "rote" serialization of the arcs of the graph 
based on how the states are numbered, and other numberings would be possible 
resulting in different token outputs because there is inherent ambiguity in how 
you serialize a graph.  I think the only constraints are that 1) all arcs 
leaving a given state must be serialized one after another (exactly like 
Lucene's Automaton class!), 2) an arc from node X must go to another node > X 
(i.e., arcs cannot go to an earlier numbered node).

> SynonymFilter should generate a correct (or, at least, better) graph
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6582
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6582
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ian Ribas
>         Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
> after2.png, after3.png, before.png
>
>
> Some time ago, I had a problem with synonyms and phrase type queries 
> (actually, it was elasticsearch and I was using a match query with multiple 
> terms and the "and" operator, as better explained here: 
> https://github.com/elastic/elasticsearch/issues/10394).
> That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
> with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
> Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
> solution. Here is a patch with a first step - the implementation to fix 
> "SynFilter to be able to 'make positions'" (as was mentioned on the 
> [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
>  In this way, the synonym filter generates a correct (or, at least, better) 
> graph.
> As the synonym matching is greedy, I only had to worry about fixing the 
> position length of the rules of the current match, no future or past synonyms 
> would "span" over this match (please correct me if I'm wrong!). It did 
> require more buffering, twice as much.
> The new behavior I added is not active by default, a new parameter has to be 
> passed in a new constructor for {{SynonymFilter}}. The changes I made do 
> change the token stream generated by the synonym filter, and I thought it 
> would be better to let that be a voluntary decision for now.
> I did some refactoring on the code, but mostly on what I had to change for 
> may implementation, so that the patch was not too hard to read. I created 
> specific unit tests for the new implementation 
> ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
> new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

Reply via email to