RE: UTF8 accents & umlauts filter?

Binkley, Peter Thu, 14 Sep 2006 10:49:25 -0700

We use ICU4J to do the filtering based on Unicode blocks. See
http://icu.sourceforge.net/userguide/Transform.html for a sense of what
you can do. It's worth it for us because we need to normalize cyrillic
as well as roman text; it might be overkill for other situations. But it
does good work. The first example on the page linked above shows
accent-stripping: you normalize to NFD (decomposed unicode, where
accents are represented as non-spacing characters), then delete all the
non-spacing characters, and finally normalize back to composed unicode.


Peter

-----Original Message-----
From: Michael Imbeault [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 13, 2006 9:34 PM
To: java-user@lucene.apache.org
Subject: Re: UTF8 accents & umlauts filter?

Thanks Yonik & Ken for both answers; I think the explanations went a
little over my head, but I think you understood what I was talking
about! Basically, a better filter to remove all possible accents (&
umlauts as a bonus, for completeness sake; I personally would have no
use for it).

I think it's way more work and way more complicated than I initially
thought it would be. Anyone feels able to do this?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:
> Thanks for the links Michael... this one does look interesting:
> http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
> The challenge would be to make it fast... perhaps a custom hash table,

> or look into the cost of a perfect hash function.
>
> Just to clear up some unicode/terminology issues:
>
> There are latin1 characters (the actual glyphs) represented by unicode

> code points 0->255 There is also a latin1 encoding for unicode (which 
> can only represent unicode code points 0->255)
> UTF8 is another encoding for unicode characters (or code points), but 
> that's not really relevant to a filter.
>
> So ISOLatin1AccentFilter removes accents from characters <= 255, and 
> it doesn't matter what the original encoding was (ascii, latin1, UTF8,

> UTF16, etc)
>
> -Yonik
>
>
> On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:
>> Right now Lucene has an accent filter (ISOLatin1AccentFilter) that 
>> remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is 
>> it planned to add such a filter (which would be very useful, as 
>> ISOLatin1AccentFilter isn't able to remove some complex accents on 
>> some languages encoded in UTF8. I would paste examples but I'm not 
>> sure that they would display correctly).? I think I saw a post long 
>> ago on this mailing list about something like that, but it has never 
>> been released officially.
>>
>> See
>>
>> 2001, first post about utf8 accents:
>> http://www.gossamer-threads.com/lists/lucene/java-user/648?search_str
>> ing=accent;#648
>>
>> 2004, a good solution, but still incomplete :
>> http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_s
>> tring=accent;#10792
>>
>> 2006, best attempt yet, but sadly undelivered :
>> http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_s
>> tring=accent;#32142
>>
>>
>> I think Lucene would benefit from a complete UTF8 accents remover...
>> right now the best solution I have is to process everything in PHP 
>> before indexing and at query time (and its a little slow).
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: UTF8 accents & umlauts filter?

Reply via email to