On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote:
On approximately 10/1/2008 11:30 AM, came the following characters
from the keyboard of James Y Knight:
BTW, Windows will cheerfully let you create and access files with
"garbage surrogates" in it.
Try it yourself:
open(u"\ud8fd", 'w').close()
os.listdir(u'.')
But Windows doesn't have the problem of non-Unicode sequences
needing to be translated to something else in the first place. So
this is mostly irrelevant to the problem at hand.
Well...either you consider lone surrogates as valid Unicode sequences,
or else Windows *does* have the problem of non-Unicode sequences
needing to be translated to something else.
Currently, the answer is that lone surrogates are treated as valid
Unicode, and allowed into Python via the windows file APIs. Thus,
filename strings in Python are going to have lone surrogates, anyways,
on Windows.
Therefore, any external library which freaks out upon seeing a lone
surrogate is already going to be broken for some filenames on Windows.
So, it seems to me, converting invalid UTF-8 sequences into lone
surrogates for Unix doesn't actually add any new form of brokenness.
So why not just do that?
So, I'm back to favoring the lone surrogate plan over the U+0000
plan. But either one seems better than the alternatives.
The original byte string must be preserved for use in actually
opening files.
Or reversibly transformed.
How it is displayed is another question. Doing something that works
for both Unicode display and access to the file is basically
impossible in all cases. Providing an encapsulation of the byte
string that has display methods, together with new methods to
transform the file path, and use parts of it to create other file
paths, is the solution I described earlier.
This sounds like a fine solution. And it would work just as well with
a UTF-8b base API as with a dual string/byte string base API. The only
difference is what the default behavior for people who don't use your
new fancy API is. In the UTF-8b case, most things would work, even
with invalidly-encoded filenames.
James
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com