Mark, I agree it could use some more tests in the future, like many things :)
On Thu, Aug 6, 2009 at 11:52 AM, Mark Miller<markrmil...@gmail.com> wrote: > Test passes with this patch - thanks a lot Robert ! I was going to ask you > to create a solr issue, but I see you already have, thanks! > > No need to create a test I think - put in the new Lucene jars and it fails, > so likely thats good enough. Though it is spooky that the test passed > without the new jars, so perhaps a more targeted test is warranted after > all. > > - Mark > > Robert Muir wrote: >> >> Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java >> =================================================================== >> --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java >> (revision >> 778975) >> +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java >> (working >> copy) >> @@ -209,7 +209,7 @@ >> //make a backup in case we exceed the word count >> System.arraycopy(termBuffer, 0, backup, 0, termBufferLength); >> } >> - if (termBuffer.length < factory.maxTokenLength) { >> + if (termBufferLength < factory.maxTokenLength) { >> int wordCount = 0; >> >> int lastWordStart = 0; >> @@ -226,8 +226,8 @@ >> } >> >> // process the last word >> - if (lastWordStart < termBuffer.length) { >> - factory.processWord(termBuffer, lastWordStart, >> termBuffer.length - lastWordStart, wordCount++); >> + if (lastWordStart < termBufferLength) { >> + factory.processWord(termBuffer, lastWordStart, >> termBufferLength - lastWordStart, wordCount++); >> } >> >> if (wordCount > factory.maxWordCount) { >> >> >> On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rcm...@gmail.com> wrote: >> >>> >>> Mark, I looked at this and think it might be unrelated to tokenstreams. >>> >>> I think the length argument being provided to processWord(char[] >>> buffer, int offset, int length, int wordCount) in that filter might be >>> incorrectly calculated. >>> This is the method that checks the keep list. >>> >>> (There is trailing trash on the end of tokens, even with the previous >>> version of lucene in Solr). >>> It just so happens the tokens with trailing trash were ones that were >>> keep words in the previous version, so the test didnt fail. >>> >>> different tokens have trailing trash in the current version >>> (specifically some of the "the" tokens), so its failing now. >>> >>> >>> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<markrmil...@gmail.com> >>> wrote: >>> >>>> >>>> I think there is an issue here, but I didn't follow the TokenStream >>>> improvements very closely. >>>> >>>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads >>>> up >>>> with keep words - it then checks (with the old TokenStream API) each >>>> token >>>> (char array) to see if it should keep it. I think because of the cloning >>>> going on in next, this breaks and you can't match anything in the keep >>>> set. >>>> Does that make sense? >>>> >>>> -- >>>> - Mark >>>> >>>> http://www.lucidimagination.com >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>>> >>>> >>>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> >> >> >> >> > > > -- > - Mark > > http://www.lucidimagination.com > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org