Luis P. Mendes wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Richie Hindle escreveu: > > [Serge] > >> def search_key(s): > >> de_str = unicodedata.normalize("NFD", s) > >> return ''.join(cp for cp in de_str if not > >> unicodedata.category(cp).startswith('M')) > > > > Lovely bit of code - thanks for posting it! > > > > You might want to use "NFKD" to normalize things like LATIN SMALL > > LIGATURE FI and subscript/superscript characters as well as diacritics. > > > > Thank you very much for your info. It's a very good aproach. > > When I used the "NFD" option, I came across many errors on these and > possibly other codes: \xba, \xc9, \xcd.
What errors? normalize method is not supposed to give any errors. You mean it doesn't work as expected? Well, I have to admit that using normalize is a far from perfect way to implement search. The most advanced algorithm is published by Unicode guys: <http://www.unicode.org/reports/tr10/> If you read it you'll understand it's not so easy. > > I tried to use "NFKD" instead, and the number of errors was only about > half a dozen, for a universe of 600000+ names, on code \xbf. > It looks like I have to do a search and substitute using regular > expressions for these cases. Or is there a better way to do it? Perhaps you can use unicode translate method to map the characters that still give you problems to whatever you want. -- http://mail.python.org/mailman/listinfo/python-list