2008/9/30 James Y Knight <[EMAIL PROTECTED]>:
>>>> u'\udc90\udc90'.encode('utf-8')
> '\xed\xb2\x90\xed\xb2\x90'
This is wrong: UTF-8 (like other UTF-x) encodes Unicode scalar values,
not Unicode code points, i.e. surrogates as such are unencodable.
'\xed\xb2\x90' is invalid UTF-8.
I've experimentally implemented (not for Python) a different escaping
scheme with a similar goal as UTF-8b: undecodable bytes are prefixed
with U+0000 instead of being converted to unpaired surrogates, and
'\x00' decodes as U+0000 U+0000.
Glib provides some functions to convert filenames for display, in a
way which is not necessarily reversible (includes some hex escapes in
ASCII).
--
Marcin Kowalczyk
[EMAIL PROTECTED]
http://qrnik.knm.org.pl/~qrczak/
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com