[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

Michael McCandless (JIRA) Thu, 18 Jun 2015 13:14:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592430#comment-14592430
 ]


Michael McCandless commented on LUCENE-6582:
--------------------------------------------

I will dig more into this patch, it is a nice (big!) change, but this TODO 
caught my eye:

{noformat}
    // TODO: Problems: In the substitution below, how to identify that the 
terms "united" "states" "of" "america"
    // are actually a phrase and not individual synonyms of "usa"? And how to 
differentiate that phrase from the
    // phrase "u" "s" "a"? We can do that adding position lengths ...
{noformat}

One approach could be to use TokenStreamToAutomaton, then enumerate all finite 
strings from the resulting automaton, and assert it's as expected?  I.e. that 
things did not get unexpectedly "sausaged"?

> SynonymFilter should generate a correct (or, at least, better) graph
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6582
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6582
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ian Ribas
>         Attachments: LUCENE-6582.patch
>
>
> Some time ago, I had a problem with synonyms and phrase type queries 
> (actually, it was elasticsearch and I was using a match query with multiple 
> terms and the "and" operator, as better explained here: 
> https://github.com/elastic/elasticsearch/issues/10394).
> That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
> with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
> Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
> solution. Here is a patch with a first step - the implementation to fix 
> "SynFilter to be able to 'make positions'" (as was mentioned on the 
> [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
>  In this way, the synonym filter generates a correct (or, at least, better) 
> graph.
> As the synonym matching is greedy, I only had to worry about fixing the 
> position length of the rules of the current match, no future or past synonyms 
> would "span" over this match (please correct me if I'm wrong!). It did 
> require more buffering, twice as much.
> The new behavior I added is not active by default, a new parameter has to be 
> passed in a new constructor for {{SynonymFilter}}. The changes I made do 
> change the token stream generated by the synonym filter, and I thought it 
> would be better to let that be a voluntary decision for now.
> I did some refactoring on the code, but mostly on what I had to change for 
> may implementation, so that the patch was not too hard to read. I created 
> specific unit tests for the new implementation 
> ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
> new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

Reply via email to