On Mon, Sep 29, 2008 at 5:12 AM, Antoine Pitrou <[EMAIL PROTECTED]> wrote: > Adam Olsen <rhamph <at> gmail.com> writes: >> >> UTF-8b doesn't work as intended. It produces an invalid unicode >> object (garbage surrogates) that cannot be used with external APIs or >> libraries that require unicode. > > At least it works with all Python operations supported by the unicode type > (methods, concatenation, etc.) without any bad surprise. That feeding it to > e.g. > PyGTK may give bogus results is another problem. > >> If you don't need unicode then your >> code should state so explicitly, and 8859-1 is ideal there. > > But then you can say bye-bye to proper representation (e.g. using print()) of > even valid filenames.
You can't print UTF-8b either. Printing requires converting the unicode object to UTF-8 (or whatever output encoding), and the unicode object isn't valid, so you'd get an exception[1]. The same applies to all other hacks (such as PUA scalars). Either the scalar value already has an expected behaviour, in which case decoding is lossy and reencoding replaces the correct behaviour, or it's not a valid scalar value, which then can't be used with any external API that requires conformant unicode. There's no solution except to not decode, and 8859-1 is the way to do that. [1] Python's UTF codecs are broken in a couple respects, including the fact that python itself uses CESU-8(!). See http://bugs.python.org/issue3297 and http://bugs.python.org/issue3672 -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com