RE: Initial work on multi word synonyms and phrase queries

Ian Fri, 19 Jun 2015 14:11:32 -0700

The problem with the tests were actually because of the IDE (Intellij). Running 
the tests with ant directly works just fine. Just thought I would have this 
registered for the record.


From: [email protected]
To: [email protected]
Subject: RE: Initial work on multi word synonyms and phrase queries
Date: Thu, 18 Jun 2015 11:53:23 +0000




Issue opened: https://issues.apache.org/jira/browse/LUCENE-6582.
@rcmuir, that change on the test is actually a leftover from one of my previous 
solutions while exploring the problem. It is no longer necessary and I removed 
it from the patch added to the issue above.
To explain a little, in an earlier solution, the current inputs were always the 
first tokens on the output, even if there were longer synonyms (in number of 
terms). That created an inconsistency between position increments and position 
lengths, as I wasn't sure I could have a position increment grater than 1. So I 
changed it to have the first tokens, the ones that actually increment the 
positions, come from the longer synonym. In this way, the token stream has the 
same behavior as before: whenever the position increment is 1, the position 
length is also 1. But that means that, when keepOriginal = true and there are 
synonyms with more terms than the input, the original input (tokens with 
type="word") will come, on the output stream,  "stacked" on top of synonym 
tokens. This seemed to me less likely to impact elsewhere.
Glad to hear you also deem that code complicated. I was assuming it was hard to 
me because I'm a beginner on the code base ;-)
About the failing tests, in my setup, they are flaky. Sometimes passing 
sometimes failing, and not always the same. But always complaining of missing 
postings formats (last time it was 'FST50'). I'll look around a little more to 
see if I can figure out what's wrong.
Ian
> From: [email protected]
> Date: Thu, 18 Jun 2015 06:02:09 -0400
> Subject: Re: Initial work on multi word synonyms and phrase queries
> To: [email protected]; [email protected]
> 
> +1 to opening an issue, thanks for exploring this!  It's hairy :)
> 
> Your windows test failures complaining about FSTOrd50 missing is
> curious ... I don't run Windows but maybe someone who does has an
> idea?  That postings format comes from lucene/codecs which should be
> on the class path during tests...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Wed, Jun 17, 2015 at 10:21 PM, Robert Muir <[email protected]> wrote:
> > Hey, thanks for tackling this! That synonymfilter is a beast...
> >
> > Can you open a JIRA issue with your patch?
> >
> > To me the interesting part is this change in the test:
> >
> >           if (posInc > 0) {
> >             // This token increments position, so it is starting a new 
> > position.
> >             // Its position is the last position plus the posLength of the
> >             // last token that started a position.
> >             pos += lastPosLength;
> >             lastPosLength = posLength;
> >           }
> >
> > This currently implies some change to how posInc/posLen are treated on
> > the consumer side: it would need changes to queryparsers and
> > indexwriter to work (which is fine, we could figure out those
> > semantics). But its my understanding this logic might be based on some
> > properties specific to synonymfilter being greedy, and not really
> > general to all streams. So maybe it synonymfilter or some other filter
> > needs to do this adjustment internally instead.
> >
> > Anyway, I think we should make an issue and investigate it.
> >
> > On Wed, Jun 17, 2015 at 9:56 PM, Ian <[email protected]> wrote:
> >> Hello,
> >>
> >> Some time ago, I had a problem with synonyms and phrase type queries
> >> (actually, it was elasticsearch and I was using a match query with multiple
> >> terms and the "and" operator, as better explained here:
> >> https://github.com/elastic/elasticsearch/issues/10394).
> >>
> >> That issue led to some work on Lucene:
> >> https://issues.apache.org/jira/browse/LUCENE-6400 (where I helped a little
> >> with tests) and  https://issues.apache.org/jira/browse/LUCENE-6401. This
> >> issue is also related to https://issues.apache.org/jira/browse/LUCENE-3843.
> >>
> >> Starting from the discussion on LUCENE-6400, I'm attempting to implement a
> >> solution. Here is a patch with a first step - the implementation to fix
> >> "SynFilter to be able to 'make positions'" (as was mentioned on the issue).
> >> In this way, the synonym filter generates a correct (or, at least, better)
> >> graph.
> >>
> >> As the synonym matching is greedy, I only had to worry about fixing the
> >> position length of the rules of the current match, no future or past
> >> synonyms would "span" over this match (please correct me if I'm wrong!). It
> >> did require more buffering, twice as much.
> >>
> >> The new behavior I added is not active by default, a new parameter has to 
> >> be
> >> passed in a new constructor for SynonymFilter. The changes I made do change
> >> the token stream generated by the synonym filter, and I thought it would be
> >> better to let that be a voluntary decision for now.
> >>
> >> I did some refactoring on the code, but mostly on what I had to change for
> >> may implementation, so that the patch was not too hard to read. I created
> >> specific unit tests for the new implementation (TestMultiWordSynonymFilter)
> >> that should show how things will be with the new behavior.
> >>
> >> Speaking of tests, I ran "analysis-common" tests locally (windows 8, java
> >> 8), and had only 2 unrelated failures (as far as I can tell) complaining of
> >> missing PostingsFormat "FSTOrd50".
> >>
> >> Thanks for any help, comment, adjustment on the patch. I'll do my best to
> >> make the necessary adjustments.
> >>
> >> Please forgive me if I did not follow any rule, of the code or of the list,
> >> and I would be grateful to be able to learn from my mistakes.
> >>
> >> Regards,
> >> Ian
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

RE: Initial work on multi word synonyms and phrase queries

Reply via email to