Actually, Walter, I misspoke: Morfologik is a lemmatizer: it produces surface 
forms.  Not really so incompatible, I think.

Regardless of the choice to use this particular sequence of filters, 
EdgeNGramTokenFilter shouldn't produce a bad stream.

Steve

On Apr 21, 2013, at 8:34 PM, Walter Underwood <wun...@wunderwood.org> wrote:

> Don't use a stemmer with edge ngrams.
> 
> Edge ngrams are a tool for matching the surface word. Stemmers are a tool for 
> matching the root. Those are logically incompatible transforms. 
> 
> wunder
> 
> On Apr 21, 2013, at 5:21 PM, Steve Rowe wrote:
> 
>> Karol has uncovered a bug introduced by LUCENE-4810 
>> <https://issues.apache.org/jira/browse/LUCENE-4810>, included in Lucene/Solr 
>> 4.3.0.
>> 
>> The problem is an interaction between the Morfologik stemmer, which can 
>> produce multiple stems per input term, all but the first having a position 
>> increment of zero, and EdgeNGramTokenFilter, which only outputs ngrams for 
>> input terms that are at least as long as the minimum configured length, and 
>> passes through unchanged the position increment for the first ngram output 
>> for any given input term.
>> 
>> So what happens in Karol's case is that "T." has the period stripped by 
>> StandardTokenizer, then is stemmed by Morfologik to produce terms "to", 
>> "tom" and "tona".  The first term "to" has a position increment of 1, but is 
>> not output by EdgeNGramTokenFilter, because it's length is below the 
>> configured minimum of 3.  The second term "tom" is given a position 
>> increment of 0 by Morfologik, and meets EdgeNGramTokenFilter's minimum 
>> length, so gets output, and since it's the first output term for the input 
>> term "tom", the input position increment is left as-is in the output term: 
>> 0.  That's how the first output term gets a position increment of 0.
>> 
>> Before LUCENE-4810 was committed and included in Lucene/Solr 4.3.0, 
>> EdgeNGramTokenFilter indiscriminately set all output terms' position 
>> increments to 1, so that explains why this behavior didn't occur with 
>> previously released versions.
>> 
>> I think the fix is a check in EdgeNGramTokenFilter when outputting the first 
>> term, that the position increment is greater than 0, and if it's not, then 
>> it should be set it to 1.
>> 
>> Does anybody know if this could also be an issue for other filters?
>> 
>> I'll work on a patch for EdgeNGramTokenFilter.
>> 
>> Steve
>> 
>> On Apr 21, 2013, at 9:21 AM, Karol Sikora <karol.sik...@laboratorium.ee> 
>> wrote:
>> 
>>> hi,
>>> 
>>> I extracted minimal failing example, solr configs(schema, solrconfig.xml) 
>>> and data are in attached archive.
>>> I try to import simple document:
>>> [
>>>    {
>>>        "publisher": [
>>>            "T. Gl\u00fccksberg"
>>>        ],  
>>>        "uid": "1000881" 
>>>    }, 
>>>    {
>>>        "publisher": [
>>>      "Ala a kota"
>>>        ],
>>>        "uid": "1000894"
>>>    }
>>> ]
>>> first fails on copyfield destination publisher_hl with exception (trace: 
>>> https://gist.github.com/anonymous/5429558), second is added without any 
>>> problems.
>>> schema.xml is here: https://gist.github.com/anonymous/5429562
>>> 
>>> When someone will trying to reproduce this behaviour remember to copy libs 
>>> related with morfologik and icu filters.
>>> 
>>> This extracted example works fine with solr 4.0 - 4.2.1.
>>> 
>>> Regards,
>>> Karol
>>> 
>>> 
>>> 
>>> W dniu 21.04.2013 09:03, Simon Willnauer pisze:
>>>> hey karol,
>>>> 
>>>> can you reproduce this behaviour in a small test-case (curl command or
>>>> something like this) that we can reproduce?
>>>> 
>>>> @solr guys any idea what this could be?
>>>> 
>>>> simon
>>>> 
>>>> On Sun, Apr 21, 2013 at 1:52 AM, Karol Sikora
>>>> 
>>>> <karol.sik...@laboratorium.ee>
>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I have problem with solr 4.3 RC2 on my testing data for searching
>>>>> application which i'm developing.
>>>>> A lot of importing records fails with exception
>>>>> "java.lang.IllegalArgumentException: first position increment must be > 0
>>>>> (got 0)". On versions from early 4.0 to 4.2.1 all documents was added
>>>>> successfully, so I'm thinking that something is broken in new release.
>>>>> I'll try examine tomorrow what is broken.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Karol
>>>>> 
>>>>> W dniu 20.04.2013 21:07, Andi Vajda pisze:
>>>>> 
>>>>> 
>>>>>> On Sat, 20 Apr 2013, Simon Willnauer wrote:
>>>>>> 
>>>>>> 
>>>>>>> Here is the RC:
>>>>>>> 
>>>>>>> 
>>>>>>> http://people.apache.org/~simonw/staging_area/lucene-solr-4.3.0-RC2-rev1470054
>>>>>>> 
>>>>>>> 
>>>>>>> happy voting...
>>>>>>> 
>>>>>>> here is my +1
>>>>>>> 
>>>>>> 
>>>>>> PyLucene 4.3 builds and passes its tests.
>>>>>> 
>>>>>> +1 !
>>>>>> 
>>>>>> Andi..
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: 
>>>>>> dev-unsubscr...@lucene.apache.org
>>>>>> 
>>>>>> For additional commands, e-mail: 
>>>>>> dev-h...@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> --
>>>>> Karol Sikora
>>>>> +48 781 493 788
>>>>> 
>>>>> Laboratorium EE
>>>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>>>> 
>>>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: 
>>>>> dev-unsubscr...@lucene.apache.org
>>>>> 
>>>>> For additional commands, e-mail: 
>>>>> dev-h...@lucene.apache.org
>>>>> 
>>>>> 
>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: 
>>>> dev-unsubscr...@lucene.apache.org
>>>> 
>>>> For additional commands, e-mail: 
>>>> dev-h...@lucene.apache.org
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> 
>>> Karol Sikora
>>> Kierownik Informatyczny Projektu CBN - Interfejs 2.0
>>> +48 781 493 788
>>> 
>>> Laboratorium EE
>>> ul. Mokotowska 46A/23 | 00-543 Warszawa |
>>> 
>>> www.laboratorium.ee | www.laboratorium.ee/facebook
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
> 
> --
> Walter Underwood
> wun...@wunderwood.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to