Rajan, Renuka wrote:
I am trying to match accented characters with non-accented characters in French/Spanish and other Western European languages. The use case is that the users may type letters without accents in error and we still want to be able to retrieve valid matches. The one idea, albeit naïve, is to normalize the data on the inbound side as well as the data in the database (prior to full text indexing) and retrieve matches.
Look back through the archives a bit for ISOLatin1AccentFilter. It almost does the job and works reasonably well for western european characters. You'll also find a posting of mine that presents a somewhat more complete filter based on the unicode decompositions. If you can't find it I'll dig out the stuff I wrote and re-post it (and then maybe some kind soul will add it alongside ISOLatin1AccentFilter).

Eric Jain's comment about "ä" being converted to "a" instead of "ae" is a fair one, but it probably doesn't much matter. Although I have seen "Müller" written as both "Muller" and "Mueller" so you're not going to be able to please everyone all the time without injecting synonyms and being very clever. And if you're that clever you might catch both "encyclopedia" and "encyclopædia" -- the latter converted to "encyclopaedia" which isn't the same as "encyclopëdia"!

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to