2008/9/30 James Y Knight <[EMAIL PROTECTED]>:

>>>> u'\udc90\udc90'.encode('utf-8')
> '\xed\xb2\x90\xed\xb2\x90'

This is wrong: UTF-8 (like other UTF-x) encodes Unicode scalar values,
not Unicode code points, i.e. surrogates as such are unencodable.
'\xed\xb2\x90' is invalid UTF-8.

I've experimentally implemented (not for Python) a different escaping
scheme with a similar goal as UTF-8b: undecodable bytes are prefixed
with U+0000 instead of being converted to unpaired surrogates, and
'\x00' decodes as U+0000 U+0000.

Glib provides some functions to convert filenames for display, in a
way which is not necessarily reversible (includes some hex escapes in
ASCII).

-- 
Marcin Kowalczyk
[EMAIL PROTECTED]
http://qrnik.knm.org.pl/~qrczak/
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to