RE: Accentuated characters

Eric Isakson Tue, 10 Dec 2002 13:32:10 -0800

If you really want to make your own TokenFilter, have a look at 
org.apache.lucene.analysis.LowerCaseFilter.next()


it does:
  public final Token next() throws java.io.IOException {
    Token t = input.next();

    if (t == null)
      return null;

    t.termText = t.termText.toLowerCase();

    return t;
  }

The termText member of the Token class is package scoped, so you will have to 
implement your filter in the org.apache.lucene.analysis package. No worries about 
encoding as the termText is already a java (unicode) string. You will just have to 
provide the mechanism to get the accented characters converted to there non-accented 
equivalents. java.text.Collator has some magic that does this for string comparisons 
but I couldn't find any public methods that give you access to convert a string to its 
non-accented equivalent.

Eric
--
Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639         http://www.sas.com



-----Original Message-----
From: stephane vaucher [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 10, 2002 2:58 PM
To: [EMAIL PROTECTED]
Subject: Accentuated characters


Hello everyone,

I wish to implement a TokenFilter that will remove accentuated 
characters so for example '�' will become 'e'. As I would rather not 
reinvent the wheel, I've tried to find something on the web and on the 
mailing lists. I saw a mention of a contrib that could do this (see 
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
but I don't see anything applicable.

Has anyone done this yet, if so I would much appreciate some pointers 
(or code), otherwise, I'll be happy to contribute whatever I produce 
(but it might be very simple since I'll only need to deal with french).

Cheers,
Stephane


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Accentuated characters

Reply via email to