-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> When I used the "NFD" option, I came across many errors on these and >> possibly other codes: \xba, \xc9, \xcd. > > What errors? normalize method is not supposed to give any errors. You > mean it doesn't work as expected? Well, I have to admit that using > normalize is a far from perfect way to implement search. The most > advanced algorithm is published by Unicode guys: > <http://www.unicode.org/reports/tr10/> If you read it you'll understand > it's not so easy. > >> I tried to use "NFKD" instead, and the number of errors was only about >> half a dozen, for a universe of 600000+ names, on code \xbf. >> It looks like I have to do a search and substitute using regular >> expressions for these cases. Or is there a better way to do it? > > Perhaps you can use unicode translate method to map the characters that > still give you problems to whatever you want. >
Errors occur when I assign the result of ''.join(cp for cp in de_str if not unicodedata.category(cp).startswith('M')) to a variable. The same happens with de_str. When I print the strings everything is ok. Here's a short example of data: 115448,DAÇÃO 117788,DA 1º DE MO Nº 2 I used the following script to convert the data: # -*- coding: iso8859-15 -*- class Latin1ToAscii: def abreFicheiro(self): import csv self.reader = csv.reader(open(self.input_file, "rb")) def converter(self): import unicodedata self.lista_csv = [] for row in self.reader: s = unicode(row[1],"latin-1") de_str = unicodedata.normalize("NFD", s) nome = ''.join(cp for cp in de_str if not \ unicodedata.category(cp).startswith('M')) linha_ascii = row[0] + "," + nome # * print linha_ascii.encode("ascii") self.lista_csv.append(linha_ascii) def __init__(self): self.input_file = 'nome_latin1.csv' self.output_file = 'nome_ascii.csv' if __name__ == "__main__": f = Latin1ToAscii() f.abreFicheiro() f.converter() And I got the following result: $ python latin1_to_ascii.py 115448,DACAO Traceback (most recent call last): File "latin1_to_ascii.py", line 44, in ? f.converter() File "latin1_to_ascii.py", line 22, in converter print linha_ascii.encode("ascii") UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in position 11: ordinal not in range(128) The script converted the ÇÃ from the first line, but not the º from the second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a [u'115448,DAÇÃO'] element, which doesn't suit my needs. Would you mind telling me what should I change? Luis P. Mendes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFEYN7+Hn4UHCY8rB8RAjcTAKCgEkZwCURgp/VrtthM1MBba+d7KACfY9dj xcHVL1BuhyrPV8+9Z5Q2AJQ= =+AO0 -----END PGP SIGNATURE----- -- http://mail.python.org/mailman/listinfo/python-list