Re: [Python-Dev] Python-3.0, unicode, and os.environ

glyph Thu, 04 Dec 2008 19:47:55 -0800

On 02:08 am, [EMAIL PROTECTED] wrote:

James Y Knight wrote:

On Dec 4, 2008, at 6:39 PM, Martin v. L�wis wrote:

I'm in favour of a different, fifth solution:


5) represent all environment variables in Unicode strings,
  including the ones that currently fail to decode.
  (then do the same to file names, then drop the byte-oriented
   file operations again)

FWIW, I still agree with Martin that that's the most reasonablesolution.
FWIW2, I have much the same feeling.

And I still disagree, but I re-read the old thread and didn't see muchof a clear argument there, so at least I'm not re-treading old ground:).

The only strategy that would allow us to encode all inputs as unicode(including the invalid ones) is to abuse NUL to mean "ha ha, this isn'tactually a unicode string, it's something I couldn't decode". This isnice because it allows the type of the returned value to be the same, soa Python program that expects a unicode object will be able tomanipulate this object (as long as it doesn't split it up too close to aNUL).

It seems to me that this convenient, but clever-clever type distinctionwill inevitably be a bug magnet. For the most basic example, see thecaveat above. But more realistically - not too much code splitsfilenames on anything but "." or os.sep, after all - if you pass this toan extension module which then wants to invoke a C library functionwhich passes the file name to open() and friends, what is the rightthing for the extension module to do? There would need to be a new APIwhich could get the "right" bytes out of a unicode string whichpotentially has NULs in it. This can't just be an encoding, either,because you might need to get the Shift-JIS bytes (some foreign system'sencoding) for some got-NULs-in-it filename even though your locale saysthe encoding is UTF-8. And what if those bytes happen to be validShift-JIS? Decoding bytes makes a lot more sense to me than transcodingstrings.

Filenames and environment variables would all need to be encoded ordecoded according to this magic encoding. And what happens if you getsome garbage data from elsewhere and pass it to a function that*generates* a filename? Now, you get a pleasant error message,"TypeError: file() argument 1 must be (encoded string without NULLbytes), not str". In the future, I can only assume (if you're lucky)that you'll get some weird thing out of the guts of an encoding module;or, more likely, some crazy mojibake filename containing PUA code pointsor whatever will silently get opened. You can make this less likely(and harder to debug in the odd cases where it does happen) by makingthe encoding more clever, but eventually your luck will run out: mostlikely on somebody's computer who doesn't speak english well enough toreport the problem clearly.

The scenario gets progressively more nightmarish as you start puttingmore libraries into the mix. You pass some environment variable intosome library which knows all about unicode and happily handles itcorrectly, but a second library which doesn't know about this proposedtricky NUL convention gets the unicode filename and transcodes itliterally, causing an error return from open(). This puts the apparenterror very far away from the responsible code.

Ultimately it makes sense to expose the underlying bytes as byteswithout forcing everyone to pretend that they make sense as anything butbytes, and allow different applications to make appropriately educatedguesses about their character format. In any case, programmers whodon't know about these kinds of issues are going to make mistakes inhandling invalid filenames on UNIXy systems, and some users won't beable to open some files. If there is an easy and straightforward way toget the bytes out, it's more likely that programmers who know what theyare doing will be able to get the correct behavior.

Of course, the NUL-encoding trick will make it *possible* to do theright thing, but our hypothetically savvy programmer now needs to learnabout the bytes/unicode distinction betweenwindows/mac+linux+everythingelse, and Python's special convention forinvalid data, and how to mix it with encoding/decoding/transcoding,rather than just Python's distinct API for the distinct types that mayrepresent a filename. I think this is significantly harder to documentthan just having two parallel APIs (environ, environb, open(str),open(bytes), listdir(str), listdir(bytes)) to reflect the very subtle,but nevertheless very real, distinction between the Windows and UNIXworlds.

This distinct API can still provide the same illusion of "it usuallyworks" portability that the encoding convention can: for Windows,environb can be the representation of the environment in a particularencoding; for UNIX, environ(u) can be all of the variables whichcorrectly decode. And so on for each other API.

At least this time I think I've encapsulated pretty much my entireargument here, so if you don't buy it, we can probably just agree todisagree :).

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

Reply via email to