Re: Issue with Solr TokenFilter and the new TokenStream API

Robert Muir Thu, 06 Aug 2009 08:56:38 -0700

Mark, I agree it could use some more tests in the future, like many things :)


On Thu, Aug 6, 2009 at 11:52 AM, Mark Miller<[email protected]> wrote:
> Test passes with this patch - thanks a lot Robert ! I was going to ask you
> to create a solr issue, but I see you already have, thanks!
>
> No need to create a test I think - put in the new Lucene jars and it fails,
> so likely thats good enough. Though it is spooky that the test passed
> without the new jars, so perhaps a more targeted test is warranted after
> all.
>
> - Mark
>
> Robert Muir wrote:
>>
>> Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>> ===================================================================
>> --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>>  (revision
>> 778975)
>> +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>>  (working
>> copy)
>> @@ -209,7 +209,7 @@
>>         //make a backup in case we exceed the word count
>>         System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
>>       }
>> -      if (termBuffer.length < factory.maxTokenLength) {
>> +      if (termBufferLength < factory.maxTokenLength) {
>>         int wordCount = 0;
>>
>>         int lastWordStart = 0;
>> @@ -226,8 +226,8 @@
>>         }
>>
>>         // process the last word
>> -        if (lastWordStart < termBuffer.length) {
>> -          factory.processWord(termBuffer, lastWordStart,
>> termBuffer.length - lastWordStart, wordCount++);
>> +        if (lastWordStart < termBufferLength) {
>> +          factory.processWord(termBuffer, lastWordStart,
>> termBufferLength - lastWordStart, wordCount++);
>>         }
>>
>>         if (wordCount > factory.maxWordCount) {
>>
>>
>> On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<[email protected]> wrote:
>>
>>>
>>> Mark, I looked at this and think it might be unrelated to tokenstreams.
>>>
>>> I think the length argument being provided to processWord(char[]
>>> buffer, int offset, int length, int wordCount) in that filter might be
>>> incorrectly calculated.
>>> This is the method that checks the keep list.
>>>
>>> (There is trailing trash on the end of tokens, even with the previous
>>> version of lucene in Solr).
>>> It just so happens the tokens with trailing trash were ones that were
>>> keep words in the previous version, so the test didnt fail.
>>>
>>> different tokens have trailing trash in the current version
>>> (specifically some of the "the" tokens), so its failing now.
>>>
>>>
>>> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<[email protected]>
>>> wrote:
>>>
>>>>
>>>> I think there is an issue here, but I didn't follow the TokenStream
>>>> improvements very closely.
>>>>
>>>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
>>>> up
>>>> with keep words - it then checks (with the old TokenStream API) each
>>>> token
>>>> (char array) to see if it should keep it. I think because of the cloning
>>>> going on in next, this breaks and you can't match anything in the keep
>>>> set.
>>>> Does that make sense?
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://www.lucidimagination.com
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>>
>>> --
>>> Robert Muir
>>> [email protected]
>>>
>>>
>>
>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Issue with Solr TokenFilter and the new TokenStream API

Reply via email to