We use ICU4J to do the filtering based on Unicode blocks. See http://icu.sourceforge.net/userguide/Transform.html for a sense of what you can do. It's worth it for us because we need to normalize cyrillic as well as roman text; it might be overkill for other situations. But it does good work. The first example on the page linked above shows accent-stripping: you normalize to NFD (decomposed unicode, where accents are represented as non-spacing characters), then delete all the non-spacing characters, and finally normalize back to composed unicode.
Peter -----Original Message----- From: Michael Imbeault [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 13, 2006 9:34 PM To: java-user@lucene.apache.org Subject: Re: UTF8 accents & umlauts filter? Thanks Yonik & Ken for both answers; I think the explanations went a little over my head, but I think you understood what I was talking about! Basically, a better filter to remove all possible accents (& umlauts as a bonus, for completeness sake; I personally would have no use for it). I think it's way more work and way more complicated than I initially thought it would be. Anyone feels able to do this? Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: > Thanks for the links Michael... this one does look interesting: > http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt > The challenge would be to make it fast... perhaps a custom hash table, > or look into the cost of a perfect hash function. > > Just to clear up some unicode/terminology issues: > > There are latin1 characters (the actual glyphs) represented by unicode > code points 0->255 There is also a latin1 encoding for unicode (which > can only represent unicode code points 0->255) > UTF8 is another encoding for unicode characters (or code points), but > that's not really relevant to a filter. > > So ISOLatin1AccentFilter removes accents from characters <= 255, and > it doesn't matter what the original encoding was (ascii, latin1, UTF8, > UTF16, etc) > > -Yonik > > > On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: >> Right now Lucene has an accent filter (ISOLatin1AccentFilter) that >> remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is >> it planned to add such a filter (which would be very useful, as >> ISOLatin1AccentFilter isn't able to remove some complex accents on >> some languages encoded in UTF8. I would paste examples but I'm not >> sure that they would display correctly).? I think I saw a post long >> ago on this mailing list about something like that, but it has never >> been released officially. >> >> See >> >> 2001, first post about utf8 accents: >> http://www.gossamer-threads.com/lists/lucene/java-user/648?search_str >> ing=accent;#648 >> >> 2004, a good solution, but still incomplete : >> http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_s >> tring=accent;#10792 >> >> 2006, best attempt yet, but sadly undelivered : >> http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_s >> tring=accent;#32142 >> >> >> I think Lucene would benefit from a complete UTF8 accents remover... >> right now the best solution I have is to process everything in PHP >> before indexing and at query time (and its a little slow). > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]