Mark, I looked at this and think it might be unrelated to tokenstreams.

I think the length argument being provided to processWord(char[]
buffer, int offset, int length, int wordCount) in that filter might be
incorrectly calculated.
This is the method that checks the keep list.

(There is trailing trash on the end of tokens, even with the previous
version of lucene in Solr).
It just so happens the tokens with trailing trash were ones that were
keep words in the previous version, so the test didnt fail.

different tokens have trailing trash in the current version
(specifically some of the "the" tokens), so its failing now.


On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<markrmil...@gmail.com> wrote:
> I think there is an issue here, but I didn't follow the TokenStream
> improvements very closely.
>
> In Solr, CapitalizationFilterFactory has a CharArray set that it loads up
> with keep words - it then checks (with the old TokenStream API) each token
> (char array) to see if it should keep it. I think because of the cloning
> going on in next, this breaks and you can't match anything in the keep set.
> Does that make sense?
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to