On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote:

On approximately 10/1/2008 11:30 AM, came the following characters from the keyboard of James Y Knight:
BTW, Windows will cheerfully let you create and access files with "garbage surrogates" in it.
Try it yourself:

open(u"\ud8fd", 'w').close()
os.listdir(u'.')

But Windows doesn't have the problem of non-Unicode sequences needing to be translated to something else in the first place. So this is mostly irrelevant to the problem at hand.


Well...either you consider lone surrogates as valid Unicode sequences, or else Windows *does* have the problem of non-Unicode sequences needing to be translated to something else.

Currently, the answer is that lone surrogates are treated as valid Unicode, and allowed into Python via the windows file APIs. Thus, filename strings in Python are going to have lone surrogates, anyways, on Windows.

Therefore, any external library which freaks out upon seeing a lone surrogate is already going to be broken for some filenames on Windows. So, it seems to me, converting invalid UTF-8 sequences into lone surrogates for Unix doesn't actually add any new form of brokenness. So why not just do that?

So, I'm back to favoring the lone surrogate plan over the U+0000 plan. But either one seems better than the alternatives.

The original byte string must be preserved for use in actually opening files.

Or reversibly transformed.

How it is displayed is another question. Doing something that works for both Unicode display and access to the file is basically impossible in all cases. Providing an encapsulation of the byte string that has display methods, together with new methods to transform the file path, and use parts of it to create other file paths, is the solution I described earlier.

This sounds like a fine solution. And it would work just as well with a UTF-8b base API as with a dual string/byte string base API. The only difference is what the default behavior for people who don't use your new fancy API is. In the UTF-8b case, most things would work, even with invalidly-encoded filenames.

James
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to