Re: Need debugging knowhow for my creeping Unicodephobia

MRAB Thu, 11 Feb 2010 09:52:13 -0800

mk wrote:

kj wrote:

I have read a *ton* of stuff on Unicode.  It doesn't even seem all
that hard.  Or so I think.  Then I start writing code, and WHAM:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position0: ordinal not in range(128)


(There, see?  My Unicodephobia just went up a notch.)

Here's the thing: I don't even know how to *begin* debugging errors
like this.  This is where I could use some help.


 >>> a=u'\u0104'
 >>>
 >>> type(a)
<type 'unicode'>
 >>>
 >>> nu=a.encode('utf-8')
 >>>
 >>> type(nu)
<type 'str'>


See what I mean? You encode INTO string, and decode OUT OF string.

Traditionally strings were string of byte-sized characters. Because they
were byte-sided they could also be used to contain binary data.

Then along came Unicode.

When working with Unicode in Python 2, you should use the 'unicode' type
for text (Unicode strings) and limit the 'str' type to binary data
(bytestrings, ie bytes) only.

In Python 3 they've been renamed to 'str' for Unicode _strings_ and
'bytes' for binary data (bytes!).

To make matters more complicated, str.encode() internally DECODES fromstring into unicode:
 >>> nu
'\xc4\x84'
 >>>
 >>> type(nu)
<type 'str'>
 >>> nu.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:ordinal not in range(128)
There's logic to this, although it makes my brain want to explode. :-)

Strictly speaking, only Unicode can be encoded.

What Python 2 is doing here is trying to be helpful: if it's already a
bytestring then decode it first to Unicode and then re-encode it to a
bytestring.

Unfortunately, the default encoding is ASCII, and the bytestring isn't
valid ASCII. Python 2 is being 'helpful' in a bad way!
--
http://mail.python.org/mailman/listinfo/python-list

Re: Need debugging knowhow for my creeping Unicodephobia

Reply via email to