Oleg Parashchenko wrote: > On Mar 29, 4:53 pm, "Paul Boddie" <[EMAIL PROTECTED]> wrote: > > On 29 Mar, 06:26, "Oleg Parashchenko" <[EMAIL PROTECTED]> wrote: > > > > > > I'm working on an unicode-aware application. I like to use "print" to > > > debug programs, but in this case it was nightmare. The most popular > > > result of "print" was: > > > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position > > > 0: ordinal not in range(128)
I think I've found the actual source of this, and it isn't the print statement. UnicodeDecodeError relates to the construction of Unicode objects, not the encoding of such objects as byte strings. The terminology is explained using this simple diagram (which hopefully won't be ruined in transmission): byte string in XYZ encoding | (decode from XYZ) --> possible UnicodeDecodeError | V Unicode object | (encode to ABC) --> possible UnicodeEncodeError | V byte string in ABC encoding > > What does sys.stdout.encoding say? > > 'KOI8-R' [...] > > What do you get if you do this...? > > > > import locale > > locale.setlocale(locale.LC_ALL, "") > > print locale.getlocale() > > ('ru_RU', 'koi8-r') > > > > > What is your terminal encoding? > > koi8-r Here's a transcript on my system answering the same questions: Python 2.4.1 (#2, Oct 4 2006, 16:53:35) [GCC 3.3.5 (Debian 1:3.3.5-8ubuntu2.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.getlocale() (None, None) >>> locale.setlocale(locale.LC_ALL, "") 'en_US.ISO-8859-15' >>> locale.getlocale() ('en_US', 'iso-8859-15') So Python knows about the locale. Note that neither of us use UTF-8 as a system encoding. >>> import sys >>> sys.stdout.encoding 'ISO-8859-15' >>> sys.stdin.encoding 'ISO-8859-15' This tells us that Python could know things about writing Unicode objects out in the appropriate encoding. I wasn't sure whether Python was so smart about this, so let's see what happens... >>> print unicode("æøå") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128) Now this isn't anything to do with the print operation: what's happening here is that I'm explicitly making a Unicode object but haven't said what the encoding of my byte string is. The default encoding is 'ascii' as stated in the error message. None of the characters provided belong to the ASCII character set. We can check this by not printing anything out: >>> s = unicode("æøå") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128) So, let's try again and provide an encoding... >>> print unicode("æøå", sys.stdin.encoding) æøå Here, we've mentioned the encoding and even though the print statement is acting on a Unicode object, it seems to be happy to work out the resulting encoding. >>> print u"æøå" æøå Here, we've skipped the explicit Unicode object construction by using a Unicode literal, which works in this simple case. Of course, if your system encoding (along with the terminal) isn't capable of displaying every Unicode character, you'll experience problems doing the above. Frequently, it's interesting to encode things as UTF-8 and look at them in applications that are capable of displaying the text. Thus, you'd do something like this: import unicodedata (This gets an interesting function to help us look up characters in the Unicode database.) somefile = open("somefile.txt", "wb") print >>somefile, unicodedata.lookup("MONGOLIAN VOWEL SEPARATOR").encode("utf-8") Or even this: import codecs somefile = codecs.open("somefile.txt", "wb", encoding="utf-8") print >>somefile, unicodedata.lookup("MONGOLIAN VOWEL SEPARATOR") Here, we only specified the encoding once when opening the file. The file object accepts Unicode objects thereafter. > > Usually, if I'm wanting to print Unicode objects, I explicitly encode > > them into something I know the terminal will support. The codecs > > module can help with writing Unicode to streams in different > > encodings, too. > > As long as input/output is the only place for such need, it's ok to > encode expliciyely. But I also had problems, for example, with md5 > module, and I don't know the whole list of potential problematic > places. Therefore, I'd better go with my brutal utf8ization. It's best to decode (ie. construct Unicode objects) upon receiving data as input, and to encode (ie. convert Unicode objects to byte strings) upon producing output. What may be the problem with the md5 module, and you'd have to post example code for us to help you out, is that it assumes byte strings and doesn't work properly with Unicode objects, but I can't say for sure because I'm usually presenting byte strings to md5 module functions on the rare occasions I do anything with them. Note that one would usually calculate MD5 checksums on raw data, although I can imagine a hypothetical (although perhaps unrealistic) need to do so on Unicode text, so it doesn't necessarily make much sense to present those functions with Unicode data. Paul
-- http://mail.python.org/mailman/listinfo/python-list