Re: UTF8 accents & umlauts filter?

Ken Krugler Tue, 12 Sep 2006 14:20:16 -0700

Thanks for the links Michael... this one does look interesting:
http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
The challenge would be to make it fast... perhaps a custom hash table,
or look into the cost of a perfect hash function.


Just to clear up some unicode/terminology issues:


Some additional clarification below:

There are latin1 characters (the actual glyphs) represented by unicode
code points 0->255


Just U+00A0-> U+00FF would be considered Latin-1 by Unicode.

Unicode calls the block of Unicode code points from U+0000 -> U+007F"C0 Controls and Basic Latin".


From U+0080 to U+00FF is "C1 Controls and Latin-1 Supplement".

There is also a latin1 encoding for unicode (which can only represent
unicode code points 0->255)

There's an ISO 8859-1 charset (combination of character set, codepoints and encoding) that matches Unicode code points for 0x00 ->0x7F and 0xA0 -> 0xFF. Or rather, the Unicode code points for thesetwo ranges were selected to match ISO 8859-1.

UTF8 is another encoding for unicode characters (or code points), but
that's not really relevant to a filter.

So ISOLatin1AccentFilter removes accents from characters <= 255, and
it doesn't matter what the original encoding was (ascii, latin1, UTF8,
UTF16, etc)

This isn't really about the "original encoding" - by the timeISOLatin1AccentFilter is called, it's dealing with Java strings,which use the UTF-16 Unicode encoding.

I think what Michael is asking for is the implementation of one ofthe Unicode-defined normalization forms (seehttp://www.unicode.org/reports/tr15/) along with diacriticalstripping and other folding. Basically it's a way of mappingcharacters to a primary sort key.

This is pretty complex, especially when you start consideringlocale-specific details - we used ICU support for this in the past,which is where I'd probably start. ICU needs a lot of data to handlethis properly across most locales, so it's not lightweight, but itwould give you a general (albeit slower) solution.


-- Ken

On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it
planned to add such a filter (which would be very useful, as
ISOLatin1AccentFilter isn't able to remove some complex accents on some
languages encoded in UTF8. I would paste examples but I'm not sure that
they would display correctly).? I think I saw a post long ago on this
mailing list about something like that, but it has never been released
officially.

See

2001, first post about utf8 accents:
http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648
2004, a good solution, but still incomplete :
http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792
2006, best attempt yet, but sadly undelivered :
http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142

I think Lucene would benefit from a complete UTF8 accents remover...
right now the best solution I have is to process everything in PHP
before indexing and at query time (and its a little slow).


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF8 accents & umlauts filter?

Reply via email to