Marcin 'Qrczak' Kowalczyk schrieb: > I've implemented a hack which allows simple programs to "just work" in > case of UTF-8. It's a modified encoder/decoder which escapes malformed > UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte > sequences to round-trip UTF-8 decoding and encoding. It's not used by > default and it's never used when "UTF-8" is specified explicitly, > because it's not the true UTF-8, but I have an environment variable > which says "if the locale is UTF-8, use the modified UTF-8 as the > default encoding".
Actually, I think there is a "better" (i.e. more unicode-like way): use the private-use area. For "wide" Unicode, chose some "high" characters, e.g. from plane 16 (say, U+1020xx). For "narrow" Unicode, chose some from the "middle" (say, U+F4xx). There is a slight chance of ambiguity here if the actual input also contains such PUA characters; if you worry about this, you could escape those. For Py3k, I would like to propose a standard "binary" codec, which is an ASCII superset and decodes bytes 00..7F to ASCII, and bytes 80..FF to U+EFxx. This would allow to round-trip bytes through text. Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com