[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

Ian Ribas (JIRA) Thu, 18 Jun 2015 19:29:15 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592896#comment-14592896
 ]


Ian Ribas commented on LUCENE-6582:
-----------------------------------

> But after this change, the expanded synonyms become separate paths in
> the graph right? So it will look like this?
> Matching exactly the right phrases? 

Actually, I think it looks more like this:

!after2.png!

That means that "wtf the fudge" and "wow happened funny" will no longer match, 
but "wow the fudge" would.

The absolute positions still have meaning. I couldn't figure out how to clearly 
separate the two phrases from the rules (as was mentioned on the TODO comment 
above) using the token attributes. They will be stacked in the same order on 
each position, but that doesn't seem to be enough to make things unambiguous. 
Specially since with rules like:

{noformat}
pass, ticket
cheap, unexpensive
{noformat}

and tokenizing:

{noformat}
cheap pass
{noformat}

I would expect to mach all the resulting combinations: "cheap pass", 
"unexpensive pass", "cheap ticket" and "unexpensive ticket". And there is no 
difference in the representation of the tokens "unexpensive" as a synonym for 
"cheap" from "wow" as a synonym for "what" on the previous rules, using the 
attributes.

> Why -10?

I was unsure if -1 as an invalid value was clear enough and ended up using -10. 
It could probably just be -1. I'll check.

> Can we just add the new test cases into the existing (tiny) 
> TestMultiWordSynonyms.java?

Probably. Since all tests in the new file use the new constructor to force the 
new behavior, and TestMultiWordSynonyms tests the old behavior, I didn'tt want 
to mix things. But I'll just join them, with a comment on the test to make it 
clear its the old behavior.

> In the case where a given input token matches no rules in the FST, are
> we still able to pass that through to the output without buffering
> (calling capture())?

I didn't test this specifically, but would think yes. The new behavior only 
handles matches differently, I didn't have to do any extra buffering before the 
match, nor did I change anything on the matching part of the code. The extra 
buffering is needed only when there was already buffering (lookahead for a 
partial match) and the synonym was longer than the match.

> SynonymFilter should generate a correct (or, at least, better) graph
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6582
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6582
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ian Ribas
>         Attachments: LUCENE-6582.patch, after.png, after2.png, before.png
>
>
> Some time ago, I had a problem with synonyms and phrase type queries 
> (actually, it was elasticsearch and I was using a match query with multiple 
> terms and the "and" operator, as better explained here: 
> https://github.com/elastic/elasticsearch/issues/10394).
> That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
> with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
> Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
> solution. Here is a patch with a first step - the implementation to fix 
> "SynFilter to be able to 'make positions'" (as was mentioned on the 
> [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
>  In this way, the synonym filter generates a correct (or, at least, better) 
> graph.
> As the synonym matching is greedy, I only had to worry about fixing the 
> position length of the rules of the current match, no future or past synonyms 
> would "span" over this match (please correct me if I'm wrong!). It did 
> require more buffering, twice as much.
> The new behavior I added is not active by default, a new parameter has to be 
> passed in a new constructor for {{SynonymFilter}}. The changes I made do 
> change the token stream generated by the synonym filter, and I thought it 
> would be better to let that be a voluntary decision for now.
> I did some refactoring on the code, but mostly on what I had to change for 
> may implementation, so that the patch was not too hard to read. I created 
> specific unit tests for the new implementation 
> ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
> new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

Reply via email to