Moved this conversation to the dev list... Bah, you are right, wasn't thinking straight about which class I was subclassing.
I suppose org.apache.lucene.analysis.LowerCaseFilter and PorterStemFilter modify the Token termText property as an optimization. Their next() method will be called once for each token for each filter in the chain of filters during analysis. Creating a new Token for every modification could create a _lot_ of objects to be garbage collected. Should Token be immutable and these methods in LowerCaseFilter and PorterStemFilter be modified to create new Tokens and Token be modified to make its members private? Should Token expose a setTermText method that TokenFilters can use? like... public final void setTermText(final String _termText) { termText = _termText; } Would a java compiler be able to do any optimization on a method like that so we don't have to try and hand optimize that code by exposing the termText property? (I don't know much about how the compilers optimize things) Should it just be left as it is and people be aware that Token objects are not immutable? Eric -----Original Message----- From: stephane vaucher [mailto:[EMAIL PROTECTED]] Sent: Thursday, December 12, 2002 2:41 PM To: Lucene Users List Subject: Re: Accentuated characters Fair enough, but a "protected" would only allow subclasses from accessing it. Personally, I would rather not have to use a subclass to implement my feature. I think the logic behind this is that its an intrinsic property of a Term, thus it should be immutable, as any modifications to this object might have important side-effects. Stephane Eric Isakson wrote: >That method works too. Putting your token filter in the org.apache.lucene.analysis >package and replacing: > > if ( !word.equals( token.termText() ) ) { > return new Token( word, token.startOffset(), > token.endOffset(), token.type() ); > } > >with > > token.termText = word; > return token; > >will make your code operate more efficiently as you won't be creating a bunch of new >Token objects that will have to be garbage collected. Would be nice if the termText >was "protected" rather than package scoped. > >Eric > >-----Original Message----- >From: stephane vaucher [mailto:[EMAIL PROTECTED]] >Sent: Thursday, December 12, 2002 12:12 PM >To: Lucene Users List >Subject: Re: Accentuated characters > > >There is no problem with package scopes: > >This is how I remove trailing 's' chars: > > String word = token.termText(); > > if(word.endsWith("s")){ > word = word.substring(0, word.length() - 1); > } > > if ( !word.equals( token.termText() ) ) { > return new Token( word, token.startOffset(), > token.endOffset(), token.type() ); > } > >I'll take a look at how the Collator works to see if I can make a >generic (maybe locale specific) string normaliser so I could specify the >level of differences. > >Stephane > >Eric Isakson wrote: > >>If you really want to make your own TokenFilter, have a look at >org.apache.lucene.analysis.LowerCaseFilter.next() >> >>it does: >> public final Token next() throws java.io.IOException { >> Token t = input.next(); >> >> if (t == null) >> return null; >> >> t.termText = t.termText.toLowerCase(); >> >> return t; >> } >> >>The termText member of the Token class is package scoped, so you will have to >implement your filter in the org.apache.lucene.analysis package. No worries about >encoding as the termText is already a java (unicode) string. You will just have to >provide the mechanism to get the accented characters converted to there non-accented >equivalents. java.text.Collator has some magic that does this for string comparisons >but I couldn't find any public methods that give you access to convert a string to >its non-accented equivalent. >> >>Eric >>-- >>Eric D. Isakson SAS Institute Inc. >>Application Developer SAS Campus Drive >>XML Technologies Cary, NC 27513 >>(919) 531-3639 http://www.sas.com >> >> >> >>-----Original Message----- >>From: stephane vaucher [mailto:[EMAIL PROTECTED]] >>Sent: Tuesday, December 10, 2002 2:58 PM >>To: [EMAIL PROTECTED] >>Subject: Accentuated characters >> >> >>Hello everyone, >> >>I wish to implement a TokenFilter that will remove accentuated >>characters so for example 'é' will become 'e'. As I would rather not >>reinvent the wheel, I've tried to find something on the web and on the >>mailing lists. I saw a mention of a contrib that could do this (see >>http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), >>but I don't see anything applicable. >> >>Has anyone done this yet, if so I would much appreciate some pointers >>(or code), otherwise, I'll be happy to contribute whatever I produce >>(but it might be very simple since I'll only need to deal with french). >> >>Cheers, >>Stephane >> >> >>-- >>To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> >>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> >> >> >>-- >>To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> >>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> >> >> > > > >-- >To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> >For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > >-- >To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> >For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>