Re: ascii to latin1

Luis P. Mendes Tue, 09 May 2006 11:30:59 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>> When I used the "NFD" option, I came across many errors on these and
>> possibly other codes: \xba, \xc9, \xcd.
> 
> What errors? normalize method is not supposed to give any errors. You
> mean it doesn't work as expected? Well, I have to admit that using
> normalize is a far from perfect way to  implement search. The most
> advanced algorithm is published by Unicode guys:
> <http://www.unicode.org/reports/tr10/> If you read it you'll understand
> it's not so easy.
> 
>> I tried to use "NFKD" instead, and the number of errors was only about
>> half a dozen, for a universe of 600000+ names, on code \xbf.
>> It looks like I have to do a search and substitute using regular
>> expressions for these cases.  Or is there a better way to do it?
> 
> Perhaps you can use unicode translate method to map the characters that
> still give you problems to whatever you want.
>


Errors occur when I assign the result of ''.join(cp for cp in de_str if
not unicodedata.category(cp).startswith('M')) to a variable.  The same
happens with de_str.  When I print the strings everything is ok.

Here's a short example of data:
115448,DAÇÃO
117788,DA 1º DE MO Nº 2

I used the following script to convert the data:
# -*- coding: iso8859-15 -*-

class Latin1ToAscii:

        def abreFicheiro(self):
                import csv
                self.reader = csv.reader(open(self.input_file, "rb"))
                
        def converter(self):
                import unicodedata
                self.lista_csv = []
                for row in self.reader:
                        s = unicode(row[1],"latin-1")
                        de_str = unicodedata.normalize("NFD", s)
                        nome = ''.join(cp for cp in de_str if not \
                        unicodedata.category(cp).startswith('M'))

                        linha_ascii = row[0] + "," + nome  # *
                        print linha_ascii.encode("ascii")
                        self.lista_csv.append(linha_ascii)

        
        def __init__(self):
                self.input_file = 'nome_latin1.csv'
                self.output_file = 'nome_ascii.csv'

if __name__ == "__main__":
        f = Latin1ToAscii()
        f.abreFicheiro()
        f.converter()


And I got the following result:
$ python latin1_to_ascii.py
115448,DACAO
Traceback (most recent call last):
  File "latin1_to_ascii.py", line 44, in ?
    f.converter()
  File "latin1_to_ascii.py", line 22, in converter
    print linha_ascii.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
position 11: ordinal not in range(128)


The script converted the ÇÃ from the first line, but not the º from the
second one.  Still in *, I also don't get a list as [115448,DAÇÃO] but a
[u'115448,DAÇÃO'] element, which doesn't suit my needs.

Would you mind telling me what should I change?


Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYN7+Hn4UHCY8rB8RAjcTAKCgEkZwCURgp/VrtthM1MBba+d7KACfY9dj
xcHVL1BuhyrPV8+9Z5Q2AJQ=
=+AO0
-----END PGP SIGNATURE-----
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: ascii to latin1

Reply via email to