That method works too. Putting your token filter in the org.apache.lucene.analysis
package and replacing:
if ( !word.equals( token.termText() ) ) {
return new Token( word, token.startOffset(),
token.endOffset(), token.type() );
}
with
token.termText = word;
return token;
will make your code operate more efficiently as you won't be creating a bunch of new
Token objects that will have to be garbage collected. Would be nice if the termText
was "protected" rather than package scoped.
Eric
-----Original Message-----
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 12, 2002 12:12 PM
To: Lucene Users List
Subject: Re: Accentuated characters
There is no problem with package scopes:
This is how I remove trailing 's' chars:
String word = token.termText();
if(word.endsWith("s")){
word = word.substring(0, word.length() - 1);
}
if ( !word.equals( token.termText() ) ) {
return new Token( word, token.startOffset(),
token.endOffset(), token.type() );
}
I'll take a look at how the Collator works to see if I can make a
generic (maybe locale specific) string normaliser so I could specify the
level of differences.
Stephane
Eric Isakson wrote:
>If you really want to make your own TokenFilter, have a look at
>org.apache.lucene.analysis.LowerCaseFilter.next()
>
>it does:
> public final Token next() throws java.io.IOException {
> Token t = input.next();
>
> if (t == null)
> return null;
>
> t.termText = t.termText.toLowerCase();
>
> return t;
> }
>
>The termText member of the Token class is package scoped, so you will have to
>implement your filter in the org.apache.lucene.analysis package. No worries about
>encoding as the termText is already a java (unicode) string. You will just have to
>provide the mechanism to get the accented characters converted to there non-accented
>equivalents. java.text.Collator has some magic that does this for string comparisons
>but I couldn't find any public methods that give you access to convert a string to
>its non-accented equivalent.
>
>Eric
>--
>Eric D. Isakson SAS Institute Inc.
>Application Developer SAS Campus Drive
>XML Technologies Cary, NC 27513
>(919) 531-3639 http://www.sas.com
>
>
>
>-----Original Message-----
>From: stephane vaucher [mailto:[EMAIL PROTECTED]]
>Sent: Tuesday, December 10, 2002 2:58 PM
>To: [EMAIL PROTECTED]
>Subject: Accentuated characters
>
>
>Hello everyone,
>
>I wish to implement a TokenFilter that will remove accentuated
>characters so for example '�' will become 'e'. As I would rather not
>reinvent the wheel, I've tried to find something on the web and on the
>mailing lists. I saw a mention of a contrib that could do this (see
>http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html),
>but I don't see anything applicable.
>
>Has anyone done this yet, if so I would much appreciate some pointers
>(or code), otherwise, I'll be happy to contribute whatever I produce
>(but it might be very simple since I'll only need to deal with french).
>
>Cheers,
>Stephane
>
>
>--
>To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>
>
>--
>To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>
>
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>