I just committed the edge-ngrams fix on the 4.3 release branch.

I will not -1 RC2 for this, but if we're respinning anyway for SOLR-4746, 
including the edge-ngrams fix in the respin shouldn't be a problem.

Steve

On Apr 22, 2013, at 9:27 AM, Robert Muir <rcm...@gmail.com> wrote:

> If I was the RM, i would not respin for this edge-ngrams filter.
> 
> We already have tests to find such bugs, but these tests are currently 
> disabled (!) because the filter is basically rotting.
> 
> So i can't see how something can be important enough to respin a release 
> candidate for, but not important in the sense no one cares if its unit tests 
> are really working.
> 
> On Mon, Apr 22, 2013 at 9:17 AM, Simon Willnauer <simon.willna...@gmail.com> 
> wrote:
> I think we can add this to 4.3 I can roll another RC for that.
> 
> simon
> 
> On Mon, Apr 22, 2013 at 3:11 PM, Jack Krupansky <j...@basetechnology.com> 
> wrote:
> > Is this a fix to 4.3 (RC3?) or for a 4.3.1?
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Steve Rowe
> > Sent: Monday, April 22, 2013 2:07 AM
> >
> > To: dev@lucene.apache.org
> > Subject: Re: "[VOTE] Lucene/Solr 4.3 Take 2 (RC2)"
> >
> > I've reopened LUCENE-4810 and attached a patch with a test and fix for this
> > problem. - Steve
> >
> > On Apr 22, 2013, at 1:09 AM, Steve Rowe <sar...@gmail.com> wrote:
> >
> >> Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces
> >> surface forms.  Not really so incompatible, I think.
> >>
> >> Regardless of the choice to use this particular sequence of filters,
> >> EdgeNGramTokenFilter shouldn't produce a bad stream.
> >>
> >> Steve
> >>
> >> On Apr 21, 2013, at 8:34 PM, Walter Underwood <wun...@wunderwood.org>
> >> wrote:
> >>
> >>> Don't use a stemmer with edge ngrams.
> >>>
> >>> Edge ngrams are a tool for matching the surface word. Stemmers are a tool
> >>> for matching the root. Those are logically incompatible transforms.
> >>>
> >>> wunder
> >>>
> >>> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote:
> >>>
> >>>> Karol has uncovered a bug introduced by LUCENE-4810
> >>>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in 
> >>>> Lucene/Solr
> >>>> 4.3.0.
> >>>>
> >>>> The problem is an interaction between the Morfologik stemmer, which can
> >>>> produce multiple stems per input term, all but the first having a 
> >>>> position
> >>>> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams 
> >>>> for
> >>>> input terms that are at least as long as the minimum configured length, 
> >>>> and
> >>>> passes through unchanged the position increment for the first ngram 
> >>>> output
> >>>> for any given input term.
> >>>>
> >>>> So what happens in Karol's case is that "T." has the period stripped by
> >>>> StandardTokenizer, then is stemmed by Morfologik to produce terms "to",
> >>>> "tom" and "tona".  The first term "to" has a position increment of 1, 
> >>>> but is
> >>>> not output by EdgeNGramTokenFilter, because it's length is below the
> >>>> configured minimum of 3.  The second term "tom" is given a position
> >>>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum
> >>>> length, so gets output, and since it's the first output term for the 
> >>>> input
> >>>> term "tom", the input position increment is left as-is in the output 
> >>>> term:
> >>>> 0.  That's how the first output term gets a position increment of 0.
> >>>>
> >>>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0,
> >>>> EdgeNGramTokenFilter indiscriminately set all output terms' position
> >>>> increments to 1, so that explains why this behavior didn't occur with
> >>>> previously released versions.
> >>>>
> >>>> I think the fix is a check in EdgeNGramTokenFilter when outputting the
> >>>> first term, that the position increment is greater than 0, and if it's 
> >>>> not,
> >>>> then it should be set it to 1.
> >>>>
> >>>> Does anybody know if this could also be an issue for other filters?
> >>>>
> >>>> I'll work on a patch for EdgeNGramTokenFilter.
> >>>>
> >>>> Steve
> >>>>
> >>>> On Apr 21, 2013, at 9:21 AM, Karol Sikora <karol.sik...@laboratorium.ee>
> >>>> wrote:
> >>>>
> >>>>> hi,
> >>>>>
> >>>>> I extracted minimal failing example, solr configs(schema,
> >>>>> solrconfig.xml) and data are in attached archive.
> >>>>> I try to import simple document:
> >>>>> [
> >>>>>   {
> >>>>>       "publisher": [
> >>>>>           "T. Gl\u00fccksberg"
> >>>>>       ],
> >>>>>       "uid": "1000881"
> >>>>>   },
> >>>>>   {
> >>>>>       "publisher": [
> >>>>>     "Ala a kota"
> >>>>>       ],
> >>>>>       "uid": "1000894"
> >>>>>   }
> >>>>> ]
> >>>>> first fails on copyfield destination publisher_hl with exception
> >>>>> (trace: https://gist.github.com/anonymous/5429558), second is added 
> >>>>> without
> >>>>> any problems.
> >>>>> schema.xml is here: https://gist.github.com/anonymous/5429562
> >>>>>
> >>>>> When someone will trying to reproduce this behaviour remember to copy
> >>>>> libs related with morfologik and icu filters.
> >>>>>
> >>>>> This extracted example works fine with solr 4.0 - 4.2.1.
> >>>>>
> >>>>> Regards,
> >>>>> Karol
> >>>>>
> >>>>>
> >>>>>
> >>>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze:
> >>>>>>
> >>>>>> hey karol,
> >>>>>>
> >>>>>> can you reproduce this behaviour in a small test-case (curl command or
> >>>>>> something like this) that we can reproduce?
> >>>>>>
> >>>>>> @solr guys any idea what this could be?
> >>>>>>
> >>>>>> simon
> >>>>>>
> >>>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora
> >>>>>>
> >>>>>> <karol.sik...@laboratorium.ee>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> I have problem with solr 4.3 RC2 on my testing data for searching
> >>>>>>> application which i'm developing.
> >>>>>>> A lot of importing records fails with exception
> >>>>>>> "java.lang.IllegalArgumentException: first position increment must be
> >>>>>>> > 0
> >>>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added
> >>>>>>> successfully, so I'm thinking that something is broken in new
> >>>>>>> release.
> >>>>>>> I'll try examine tomorrow what is broken.
> >>>>>>>
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Karol
> >>>>>>>
> >>>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze:
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Here is the RC:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> happy voting...
> >>>>>>>>>
> >>>>>>>>> here is my +1
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> PyLucene 4.3 builds and passes its tests.
> >>>>>>>>
> >>>>>>>> +1 !
> >>>>>>>>
> >>>>>>>> Andi..
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail:
> >>>>>>>> dev-unsubscr...@lucene.apache.org
> >>>>>>>>
> >>>>>>>> For additional commands, e-mail:
> >>>>>>>> dev-h...@lucene.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>> --
> >>>>>>> Karol Sikora
> >>>>>>> +48 781 493 788
> >>>>>>>
> >>>>>>> Laboratorium EE
> >>>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
> >>>>>>>
> >>>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail:
> >>>>>>> dev-unsubscr...@lucene.apache.org
> >>>>>>>
> >>>>>>> For additional commands, e-mail:
> >>>>>>> dev-h...@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail:
> >>>>>> dev-unsubscr...@lucene.apache.org
> >>>>>>
> >>>>>> For additional commands, e-mail:
> >>>>>> dev-h...@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Karol Sikora
> >>>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0
> >>>>> +48 781 493 788
> >>>>>
> >>>>> Laboratorium EE
> >>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
> >>>>>
> >>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>>
> >>>
> >>> --
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>>
> >>>
> >>>
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to