IBM's ICU4J has a normalizer which should do what you need. It's a big library, but if you deal with multilingual text often, it might make your life easier.
-- Alex Murzaku ___________________________________________ alex(at)lissus.com http://www.lissus.com -----Original Message----- From: stephane vaucher [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 10, 2002 2:58 PM To: [EMAIL PROTECTED] Subject: Accentuated characters Hello everyone, I wish to implement a TokenFilter that will remove accentuated characters so for example '�' will become 'e'. As I would rather not reinvent the wheel, I've tried to find something on the web and on the mailing lists. I saw a mention of a contrib that could do this (see http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), but I don't see anything applicable. Has anyone done this yet, if so I would much appreciate some pointers (or code), otherwise, I'll be happy to contribute whatever I produce (but it might be very simple since I'll only need to deal with french). Cheers, Stephane -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
<<attachment: winmail.dat>>
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
