On Mon, 29 Jun 2009 13:05:51 +0100, Paul Moore wrote: >> As for a bytes version of sys.argv and os.environ, you're welcome to >> propose a patch (this would be a separate issue on the aforementioned >> issue tracker). > > But please be aware that such a proposal would have to consider: > > 1. That on Windows, the native form is the character version, and the > bytes version would have to address all the same sorts of encoding > issues that the OP is complaining about in the character versions. [1]
A bytes version doesn't make sense on Windows (at least, not on the NT-based versions, and the DOS-based branch isn't worth bothering about, IMHO). Also, Windows *needs* to deal with characters due to the fact that filenames, environment variables, etc are case-insensitive. > 2. That the proposal address the question of how to write portable, > robust, code (given that choosing argv vs argv_bytes based on > sys.platform is unlikely to count as a good option...) There is a tension here between robustness and portability. In my situation, robustness means getting the "unadulterated" data. I can always adulterate it myself if I need to. > 3. Why defining your own argv_bytes as argv_bytes = > [a.encode("iso-8859-1", "surrogateescape") for a in sys.argv] is > insufficient (excluding issues with bugs, which will be fixed > regardless) for the occasional cases where it's needed. Other than the bug, it appears to be sufficient. I don't need to support a locale where nl_langinfo(CODESET) is ISO-2022 (I *do* need to support lossless round-trip of ISO-2022 filenames, possibly stored in argv and maybe even in environ, but that's a different matter; the code only really needs to run with LANG=C). > [1] And my understanding, from the PEP, is that even on POSIX, the > argv and environ data is intended to be character data, even though > the native C APIs expose a byte-oriented interface. So conceptually, > character format is "correct" on POSIX as well... (But I don't write > code for POSIX systems, so I'll leave it to the POSIX users to debate > this point further). Even if it's "intended" to be character data, it isn't *required* to be. In particular, it's not required to be in the locale's encoding. A common example of what I need to handle is: find /www ... -print0 | xargs -0 myscript where the filenames can be in a wide variety of different encodings (sometimes even within a single directory). -- http://mail.python.org/mailman/listinfo/python-list