RE: Accentuated characters

Eric Isakson Thu, 12 Dec 2002 10:32:29 -0800

That method works too. Putting your token filter in the org.apache.lucene.analysis 
package and replacing:


            if ( !word.equals( token.termText() ) ) {
                return new Token( word, token.startOffset(),
                                  token.endOffset(), token.type() );
            }

with

            token.termText = word;
                return token;

will make your code operate more efficiently as you won't be creating a bunch of new 
Token objects that will have to be garbage collected. Would be nice if the termText 
was "protected" rather than package scoped.

Eric

-----Original Message-----
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 12, 2002 12:12 PM
To: Lucene Users List
Subject: Re: Accentuated characters


There is no problem with package scopes:

This is how I remove trailing 's' chars:

            String word = token.termText();
           
            if(word.endsWith("s")){
                word = word.substring(0, word.length() - 1);
            }
           
            if ( !word.equals( token.termText() ) ) {
                return new Token( word, token.startOffset(),
                                  token.endOffset(), token.type() );
            }

I'll take a look at how the Collator works to see if I can make a 
generic (maybe locale specific) string normaliser so I could specify the 
level of differences.

Stephane

Eric Isakson wrote:

>If you really want to make your own TokenFilter, have a look at 
>org.apache.lucene.analysis.LowerCaseFilter.next()
>
>it does:
>  public final Token next() throws java.io.IOException {
>    Token t = input.next();
>
>    if (t == null)
>      return null;
>
>    t.termText = t.termText.toLowerCase();
>
>    return t;
>  }
>
>The termText member of the Token class is package scoped, so you will have to 
>implement your filter in the org.apache.lucene.analysis package. No worries about 
>encoding as the termText is already a java (unicode) string. You will just have to 
>provide the mechanism to get the accented characters converted to there non-accented 
>equivalents. java.text.Collator has some magic that does this for string comparisons 
>but I couldn't find any public methods that give you access to convert a string to 
>its non-accented equivalent.
>
>Eric
>--
>Eric D. Isakson        SAS Institute Inc.
>Application Developer  SAS Campus Drive
>XML Technologies       Cary, NC 27513
>(919) 531-3639         http://www.sas.com
>
>
>
>-----Original Message-----
>From: stephane vaucher [mailto:[EMAIL PROTECTED]]
>Sent: Tuesday, December 10, 2002 2:58 PM
>To: [EMAIL PROTECTED]
>Subject: Accentuated characters
>
>
>Hello everyone,
>
>I wish to implement a TokenFilter that will remove accentuated 
>characters so for example '�' will become 'e'. As I would rather not 
>reinvent the wheel, I've tried to find something on the web and on the 
>mailing lists. I saw a mention of a contrib that could do this (see 
>http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
>but I don't see anything applicable.
>
>Has anyone done this yet, if so I would much appreciate some pointers 
>(or code), otherwise, I'll be happy to contribute whatever I produce 
>(but it might be very simple since I'll only need to deal with french).
>
>Cheers,
>Stephane
>
>
>--
>To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>
>
>--
>To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
>For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>
>



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Accentuated characters

Reply via email to