Re: encoding problems (� and �)

Duncan Booth Fri, 24 Mar 2006 01:15:43 -0800

Peter Otten wrote:

>> You can replace ALL of this upshifting and accent removal in one blow
>> by using the string translate() method with a suitable table.
> 
> Only if you convert to unicode first or if your data maintains 1 byte
> == 1 character, in particular it is not UTF-8. 
>


There's a nice little codec from Skip Montaro for removing accents from 
latin-1 encoded strings. It also has an error handler so you can convert 
from unicode to ascii and strip all the accents as you do so:

http://orca.mojam.com/~skip/python/latscii.py

>>> import latscii
>>> import htmlentitydefs
>>> print u'\u00c9'.encode('ascii','replacelatscii')
E
>>> 

So Bussiere could replace a large chunk of his code with:

    ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii')
    ligneA = ligneA.upper()

INPUTENCODING is 'utf8' unless (one possible explanation for his problem) 
his files are actually in some different encoding.

Unfortunately, just as I finished writing this I discovered that the 
latscii module isn't as robust as I thought, it blows up on consecutive 
accented characters. 

 :(

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problems (� and �)

Reply via email to