Re: Matching accented with non-accented characters

John Haxby Tue, 25 Jul 2006 12:57:50 -0700

Rajan, Renuka wrote:

I am trying to match accented characters with non-accented characters in French/Spanish and other Western European languages. The use case is that the users may type letters without accents in error and we still want to be able to retrieve valid matches. The one idea, albeit naïve, is to normalize the data on the inbound side as well as the data in the database (prior to full text indexing) and retrieve matches.

Look back through the archives a bit for ISOLatin1AccentFilter. Italmost does the job and works reasonably well for western europeancharacters. You'll also find a posting of mine that presents asomewhat more complete filter based on the unicode decompositions. Ifyou can't find it I'll dig out the stuff I wrote and re-post it (andthen maybe some kind soul will add it alongside ISOLatin1AccentFilter).

Eric Jain's comment about "ä" being converted to "a" instead of "ae" isa fair one, but it probably doesn't much matter. Although I have seen"Müller" written as both "Muller" and "Mueller" so you're not going tobe able to please everyone all the time without injecting synonyms andbeing very clever. And if you're that clever you might catch both"encyclopedia" and "encyclopædia" -- the latter converted to"encyclopaedia" which isn't the same as "encyclopëdia"!


jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Matching accented with non-accented characters

Reply via email to