2008/9/30 James Y Knight <[EMAIL PROTECTED]>: >>>> u'\udc90\udc90'.encode('utf-8') > '\xed\xb2\x90\xed\xb2\x90'
This is wrong: UTF-8 (like other UTF-x) encodes Unicode scalar values, not Unicode code points, i.e. surrogates as such are unencodable. '\xed\xb2\x90' is invalid UTF-8. I've experimentally implemented (not for Python) a different escaping scheme with a similar goal as UTF-8b: undecodable bytes are prefixed with U+0000 instead of being converted to unpaired surrogates, and '\x00' decodes as U+0000 U+0000. Glib provides some functions to convert filenames for display, in a way which is not necessarily reversible (includes some hex escapes in ASCII). -- Marcin Kowalczyk [EMAIL PROTECTED] http://qrnik.knm.org.pl/~qrczak/ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com