RE: Issue with Solr TokenFilter and the new TokenStream API

Uwe Schindler Thu, 06 Aug 2009 08:27:50 -0700

I have seen ur mail, but this bug should not be related to the new Token
API, it should occur with old API, too.


I did not look very close into the implementations, I only checked who
changes what in which way. And I see that there is only one Token instance
with a termBuffer that is changed. No problem at all for the new API. It
would even work with forcefully cloning Tokens inside CachingTokenFilter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Robert Muir [mailto:[email protected]]
> Sent: Thursday, August 06, 2009 5:22 PM
> To: [email protected]
> Subject: Re: Issue with Solr TokenFilter and the new TokenStream API
> 
> uwe look at the patch i pasted in haste (i have a delivery guy here,
> sorry).
> 
> the filter had a bug all along (it was using termBuffer.length for
> some length calculations).
> 
> On Thu, Aug 6, 2009 at 11:17 AM, Uwe Schindler<[email protected]> wrote:
> > I looked into the code of this Filter. It is very simple and should work
> out
> > of the box. There is no cloning done. When the indexer calls
> incrementToken,
> > the delegation to next(Token) does not clone at all. It just uses the
> > encapsulated Token instance (inside the AttributeImpl TokenWrapper) as
> > reusableToken and calls next(reusable) and then replaces the
> encapsulated
> > instance by the return value of next() -- so no cloning. As you do not
> > change the token instance at all and return the reusable token it is all
> > done on one Token/Attribute instance.
> >
> > In my opinion, this is the simpliest TokenFilter that could occur, it
> just
> > changes the contents of the buffer. By the way, this one could be easily
> > rewritten to use incrementToken() without cloning, just use
> > termAtt.setTermBuffer() and so on.
> >
> > Where do you see a problem, does it simply not work or do you think
> there
> > could be an issue?
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [email protected]
> >
> >> -----Original Message-----
> >> From: Mark Miller [mailto:[email protected]]
> >> Sent: Thursday, August 06, 2009 4:14 PM
> >> To: [email protected]
> >> Subject: Issue with Solr TokenFilter and the new TokenStream API
> >>
> >> I think there is an issue here, but I didn't follow the TokenStream
> >> improvements very closely.
> >>
> >> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
> >> up with keep words - it then checks (with the old TokenStream API) each
> >> token (char array) to see if it should keep it. I think because of the
> >> cloning going on in next, this breaks and you can't match anything in
> >> the keep set. Does that make sense?
> >>
> >> --
> >> - Mark
> >>
> >> http://www.lucidimagination.com
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 
> 
> 
> --
> Robert Muir
> [email protected]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Issue with Solr TokenFilter and the new TokenStream API

Reply via email to