Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces surface forms. Not really so incompatible, I think.
Regardless of the choice to use this particular sequence of filters, EdgeNGramTokenFilter shouldn't produce a bad stream. Steve On Apr 21, 2013, at 8:34 PM, Walter Underwood <wun...@wunderwood.org> wrote: > Don't use a stemmer with edge ngrams. > > Edge ngrams are a tool for matching the surface word. Stemmers are a tool for > matching the root. Those are logically incompatible transforms. > > wunder > > On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote: > >> Karol has uncovered a bug introduced by LUCENE-4810 >> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in Lucene/Solr >> 4.3.0. >> >> The problem is an interaction between the Morfologik stemmer, which can >> produce multiple stems per input term, all but the first having a position >> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for >> input terms that are at least as long as the minimum configured length, and >> passes through unchanged the position increment for the first ngram output >> for any given input term. >> >> So what happens in Karol's case is that "T." has the period stripped by >> StandardTokenizer, then is stemmed by Morfologik to produce terms "to", >> "tom" and "tona". The first term "to" has a position increment of 1, but is >> not output by EdgeNGramTokenFilter, because it's length is below the >> configured minimum of 3. The second term "tom" is given a position >> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum >> length, so gets output, and since it's the first output term for the input >> term "tom", the input position increment is left as-is in the output term: >> 0. That's how the first output term gets a position increment of 0. >> >> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, >> EdgeNGramTokenFilter indiscriminately set all output terms' position >> increments to 1, so that explains why this behavior didn't occur with >> previously released versions. >> >> I think the fix is a check in EdgeNGramTokenFilter when outputting the first >> term, that the position increment is greater than 0, and if it's not, then >> it should be set it to 1. >> >> Does anybody know if this could also be an issue for other filters? >> >> I'll work on a patch for EdgeNGramTokenFilter. >> >> Steve >> >> On Apr 21, 2013, at 9:21 AM, Karol Sikora <karol.sik...@laboratorium.ee> >> wrote: >> >>> hi, >>> >>> I extracted minimal failing example, solr configs(schema, solrconfig.xml) >>> and data are in attached archive. >>> I try to import simple document: >>> [ >>> { >>> "publisher": [ >>> "T. Gl\u00fccksberg" >>> ], >>> "uid": "1000881" >>> }, >>> { >>> "publisher": [ >>> "Ala a kota" >>> ], >>> "uid": "1000894" >>> } >>> ] >>> first fails on copyfield destination publisher_hl with exception (trace: >>> https://gist.github.com/anonymous/5429558), second is added without any >>> problems. >>> schema.xml is here: https://gist.github.com/anonymous/5429562 >>> >>> When someone will trying to reproduce this behaviour remember to copy libs >>> related with morfologik and icu filters. >>> >>> This extracted example works fine with solr 4.0 - 4.2.1. >>> >>> Regards, >>> Karol >>> >>> >>> >>> W dniu 21.04.2013 09:03, Simon Willnauer pisze: >>>> hey karol, >>>> >>>> can you reproduce this behaviour in a small test-case (curl command or >>>> something like this) that we can reproduce? >>>> >>>> @solr guys any idea what this could be? >>>> >>>> simon >>>> >>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora >>>> >>>> <karol.sik...@laboratorium.ee> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have problem with solr 4.3 RC2 on my testing data for searching >>>>> application which i'm developing. >>>>> A lot of importing records fails with exception >>>>> "java.lang.IllegalArgumentException: first position increment must be > 0 >>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added >>>>> successfully, so I'm thinking that something is broken in new release. >>>>> I'll try examine tomorrow what is broken. >>>>> >>>>> >>>>> Regards, >>>>> Karol >>>>> >>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze: >>>>> >>>>> >>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote: >>>>>> >>>>>> >>>>>>> Here is the RC: >>>>>>> >>>>>>> >>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054 >>>>>>> >>>>>>> >>>>>>> happy voting... >>>>>>> >>>>>>> here is my +1 >>>>>>> >>>>>> >>>>>> PyLucene 4.3 builds and passes its tests. >>>>>> >>>>>> +1 ! >>>>>> >>>>>> Andi.. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: >>>>>> dev-unsubscr...@lucene.apache.org >>>>>> >>>>>> For additional commands, e-mail: >>>>>> dev-h...@lucene.apache.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>> -- >>>>> Karol Sikora >>>>> +48 781 493 788 >>>>> >>>>> Laboratorium EE >>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa | >>>>> >>>>> www.laboratorium.ee | www.laboratorium.ee/facebook >>>>> >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: >>>>> dev-unsubscr...@lucene.apache.org >>>>> >>>>> For additional commands, e-mail: >>>>> dev-h...@lucene.apache.org >>>>> >>>>> >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: >>>> dev-unsubscr...@lucene.apache.org >>>> >>>> For additional commands, e-mail: >>>> dev-h...@lucene.apache.org >>>> >>>> >>>> >>>> >>> >>> -- >>> >>> Karol Sikora >>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0 >>> +48 781 493 788 >>> >>> Laboratorium EE >>> ul. Mokotowska 46A/23 | 00-543 Warszawa | >>> >>> www.laboratorium.ee | www.laboratorium.ee/facebook >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > -- > Walter Underwood > wun...@wunderwood.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org