On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote: >> Python has a special errorhandler, "surrogateescape" to deal with >> bytes that are not valid UTF-8.
On Wed, 18 Jan 2012 11:16:27 +0100, Olive wrote: > But is it safe even if the locale is not UTF-8? Yes. Peter's reference to UTF-8 is misleading. The surrogateescape mechanism is used to represent anything which cannot be decoded according to the locale's encoding. E.g. in the "C" locale, any byte >= 128 will be encoded as a surrogate. On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote: > It is still possible to get the original bytes: > > python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))' Except, it isn't. Because the Python dev's can't make up their mind which encoding sys.argv uses, or even document it. AFAICT: On Windows, there never was a bytes version of sys.argv to start with (the OS supplies the command line using wide strings). On Mac OS X, the command line is always decoded using UTF-8. On Unix, the command line is decoded using mbstowcs(). There isn't a Python function to query which encoding this used (if there even _is_ a corresponding Python encoding). Except on Windows (where OS APIs take wide string parameters), if a library function needs to pass a Unicode string to an API function, it will normally decode it using sys.getfilesystemencoding(), which isn't guaranteed to be the encoding which was used to fabricate sys.argv in the first place. In short: if you need to write "system" scripts on Unix, and you need them to work reliably, you need to stick with Python 2.x. -- http://mail.python.org/mailman/listinfo/python-list