On Wed, Oct 1, 2008 at 4:14 PM, James Y Knight <[EMAIL PROTECTED]> wrote: > On Oct 1, 2008, at 3:03 PM, Glenn Linderman wrote: >> On approximately 10/1/2008 11:30 AM, came the following characters from >> the keyboard of James Y Knight: >>> >>> BTW, Windows will cheerfully let you create and access files with >>> "garbage surrogates" in it. >>> Try it yourself: >>> >>> open(u"\ud8fd", 'w').close() >>> os.listdir(u'.') >> >> But Windows doesn't have the problem of non-Unicode sequences needing to >> be translated to something else in the first place. So this is mostly >> irrelevant to the problem at hand. > > > Well...either you consider lone surrogates as valid Unicode sequences, or > else Windows *does* have the problem of non-Unicode sequences needing to be > translated to something else. > > Currently, the answer is that lone surrogates are treated as valid Unicode, > and allowed into Python via the windows file APIs. Thus, filename strings in > Python are going to have lone surrogates, anyways, on Windows.
We allow lone surrogates into our unicode objects, but they aren't valid Unicode. They'll fail for any APIs that expect only valid Unicode. > Therefore, any external library which freaks out upon seeing a lone > surrogate is already going to be broken for some filenames on Windows. So, > it seems to me, converting invalid UTF-8 sequences into lone surrogates for > Unix doesn't actually add any new form of brokenness. So why not just do > that? I see it the opposite: lone surrogates on windows should be rejected from unicode APIs, just as we want to do for invalid UTF-8 on linux. But since the same rationale for having a "raw" API applies, maybe the windows byte APIs should expose raw UTF-16, rather than letting it be translated? -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com