RE: Issue with Solr TokenFilter and the new TokenStream API

Uwe Schindler Thu, 06 Aug 2009 09:17:14 -0700

Thanks, we are always here to help :-)
 
> Test passes with this patch - thanks a lot Robert ! I was going to ask
> you to create a solr issue, but I see you already have, thanks!
> 
> No need to create a test I think - put in the new Lucene jars and it
> fails, so likely thats good enough. Though it is spooky that the test
> passed without the new jars


See LUCENE-1762, I think this problems comes from there. I would strongly
suggest to create a testcase with better lists of terms of different length
and so on.

> so perhaps a more targeted test is
> warranted after all.

More tests are always better :-) When I created some tests locally to test
something (even when they are strange), I often simply add them to Lucene's
testcases.

> - Mark
> 
> Robert Muir wrote:
> > Index:
> src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
> > ===================================================================
> > --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>       (revision
> > 778975)
> > +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>       (working
> > copy)
> > @@ -209,7 +209,7 @@
> >          //make a backup in case we exceed the word count
> >          System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
> >        }
> > -      if (termBuffer.length < factory.maxTokenLength) {
> > +      if (termBufferLength < factory.maxTokenLength) {
> >          int wordCount = 0;
> >
> >          int lastWordStart = 0;
> > @@ -226,8 +226,8 @@
> >          }
> >
> >          // process the last word
> > -        if (lastWordStart < termBuffer.length) {
> > -          factory.processWord(termBuffer, lastWordStart,
> > termBuffer.length - lastWordStart, wordCount++);
> > +        if (lastWordStart < termBufferLength) {
> > +          factory.processWord(termBuffer, lastWordStart,
> > termBufferLength - lastWordStart, wordCount++);
> >          }
> >
> >          if (wordCount > factory.maxWordCount) {
> >
> >
> > On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rcm...@gmail.com> wrote:
> >
> >> Mark, I looked at this and think it might be unrelated to tokenstreams.
> >>
> >> I think the length argument being provided to processWord(char[]
> >> buffer, int offset, int length, int wordCount) in that filter might be
> >> incorrectly calculated.
> >> This is the method that checks the keep list.
> >>
> >> (There is trailing trash on the end of tokens, even with the previous
> >> version of lucene in Solr).
> >> It just so happens the tokens with trailing trash were ones that were
> >> keep words in the previous version, so the test didnt fail.
> >>
> >> different tokens have trailing trash in the current version
> >> (specifically some of the "the" tokens), so its failing now.
> >>
> >>
> >> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<markrmil...@gmail.com>
> wrote:
> >>
> >>> I think there is an issue here, but I didn't follow the TokenStream
> >>> improvements very closely.
> >>>
> >>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
> up
> >>> with keep words - it then checks (with the old TokenStream API) each
> token
> >>> (char array) to see if it should keep it. I think because of the
> cloning
> >>> going on in next, this breaks and you can't match anything in the keep
> set.
> >>> Does that make sense?
> >>>
> >>> --
> >>> - Mark
> >>>
> >>> http://www.lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >> --
> >> Robert Muir
> >> rcm...@gmail.com
> >>
> >>
> >
> >
> >
> >
> 
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Issue with Solr TokenFilter and the new TokenStream API

Reply via email to