Thanks for the links Michael... this one does look interesting: http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt The challenge would be to make it fast... perhaps a custom hash table, or look into the cost of a perfect hash function.
Just to clear up some unicode/terminology issues: There are latin1 characters (the actual glyphs) represented by unicode code points 0->255 There is also a latin1 encoding for unicode (which can only represent unicode code points 0->255) UTF8 is another encoding for unicode characters (or code points), but that's not really relevant to a filter. So ISOLatin1AccentFilter removes accents from characters <= 255, and it doesn't matter what the original encoding was (ascii, latin1, UTF8, UTF16, etc) -Yonik On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:
Right now Lucene has an accent filter (ISOLatin1AccentFilter) that remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it planned to add such a filter (which would be very useful, as ISOLatin1AccentFilter isn't able to remove some complex accents on some languages encoded in UTF8. I would paste examples but I'm not sure that they would display correctly).? I think I saw a post long ago on this mailing list about something like that, but it has never been released officially. See 2001, first post about utf8 accents: http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648 2004, a good solution, but still incomplete : http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792 2006, best attempt yet, but sadly undelivered : http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142 I think Lucene would benefit from a complete UTF8 accents remover... right now the best solution I have is to process everything in PHP before indexing and at query time (and its a little slow).
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]