On 02:08 am, [EMAIL PROTECTED] wrote:
James Y Knight wrote:
On Dec 4, 2008, at 6:39 PM, Martin v. L�wis wrote:
I'm in favour of a different, fifth solution:
5) represent all environment variables in Unicode strings,
including the ones that currently fail to decode.
(then do the same to file names, then drop the byte-oriented
file operations again)
FWIW, I still agree with Martin that that's the most reasonable
solution.
FWIW2, I have much the same feeling.
And I still disagree, but I re-read the old thread and didn't see much
of a clear argument there, so at least I'm not re-treading old ground
:).
The only strategy that would allow us to encode all inputs as unicode
(including the invalid ones) is to abuse NUL to mean "ha ha, this isn't
actually a unicode string, it's something I couldn't decode". This is
nice because it allows the type of the returned value to be the same, so
a Python program that expects a unicode object will be able to
manipulate this object (as long as it doesn't split it up too close to a
NUL).
It seems to me that this convenient, but clever-clever type distinction
will inevitably be a bug magnet. For the most basic example, see the
caveat above. But more realistically - not too much code splits
filenames on anything but "." or os.sep, after all - if you pass this to
an extension module which then wants to invoke a C library function
which passes the file name to open() and friends, what is the right
thing for the extension module to do? There would need to be a new API
which could get the "right" bytes out of a unicode string which
potentially has NULs in it. This can't just be an encoding, either,
because you might need to get the Shift-JIS bytes (some foreign system's
encoding) for some got-NULs-in-it filename even though your locale says
the encoding is UTF-8. And what if those bytes happen to be valid
Shift-JIS? Decoding bytes makes a lot more sense to me than transcoding
strings.
Filenames and environment variables would all need to be encoded or
decoded according to this magic encoding. And what happens if you get
some garbage data from elsewhere and pass it to a function that
*generates* a filename? Now, you get a pleasant error message,
"TypeError: file() argument 1 must be (encoded string without NULL
bytes), not str". In the future, I can only assume (if you're lucky)
that you'll get some weird thing out of the guts of an encoding module;
or, more likely, some crazy mojibake filename containing PUA code points
or whatever will silently get opened. You can make this less likely
(and harder to debug in the odd cases where it does happen) by making
the encoding more clever, but eventually your luck will run out: most
likely on somebody's computer who doesn't speak english well enough to
report the problem clearly.
The scenario gets progressively more nightmarish as you start putting
more libraries into the mix. You pass some environment variable into
some library which knows all about unicode and happily handles it
correctly, but a second library which doesn't know about this proposed
tricky NUL convention gets the unicode filename and transcodes it
literally, causing an error return from open(). This puts the apparent
error very far away from the responsible code.
Ultimately it makes sense to expose the underlying bytes as bytes
without forcing everyone to pretend that they make sense as anything but
bytes, and allow different applications to make appropriately educated
guesses about their character format. In any case, programmers who
don't know about these kinds of issues are going to make mistakes in
handling invalid filenames on UNIXy systems, and some users won't be
able to open some files. If there is an easy and straightforward way to
get the bytes out, it's more likely that programmers who know what they
are doing will be able to get the correct behavior.
Of course, the NUL-encoding trick will make it *possible* to do the
right thing, but our hypothetically savvy programmer now needs to learn
about the bytes/unicode distinction between
windows/mac+linux+everythingelse, and Python's special convention for
invalid data, and how to mix it with encoding/decoding/transcoding,
rather than just Python's distinct API for the distinct types that may
represent a filename. I think this is significantly harder to document
than just having two parallel APIs (environ, environb, open(str),
open(bytes), listdir(str), listdir(bytes)) to reflect the very subtle,
but nevertheless very real, distinction between the Windows and UNIX
worlds.
This distinct API can still provide the same illusion of "it usually
works" portability that the encoding convention can: for Windows,
environb can be the representation of the environment in a particular
encoding; for UNIX, environ(u) can be all of the variables which
correctly decode. And so on for each other API.
At least this time I think I've encapsulated pretty much my entire
argument here, so if you don't buy it, we can probably just agree to
disagree :).
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com