Re: [Python-ideas] Fix default encodings on Windows

Steve Dower Wed, 10 Aug 2016 16:49:48 -0700

On 10Aug2016 1630, Random832 wrote:

On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:

Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
locales that use a DBCS codepage such as 932.


Er... utf-8 doesn't work reliably with arbitrary bytes paths either,
unless you intend to use surrogateescape (which you could also do with
mbcs).

Is there any particular reason to expect all bytes paths in this
scenario to be valid UTF-8?

On Windows, all paths are effectively UCS-2 (they are defined as UTF-16,but surrogate pairs don't seem to be validated, which IIUC means it'sreally UCS-2), so while the majority can be encoded as valid UTF-8,there are some paths which cannot. (These paths are going to break manyother tools though, such as PowerShell, so we won't be in bad company ifwe can't handle them properly in edge cases).

surrogateescape is irrelevant because it's only for decoding from bytes.An alternative approach would be to replace mbcs with a ucs-2 encodingthat is basically just a blob of the path that was returned from Windows(using the Unicode APIs). None of the manipulation functions would workon this though, since nearly every second character would be \x00, butit's the only way (besides using str) to maintain full fidelity forevery possible path name.

Compromising on UTF-8 is going to increase consistency across platformsand across different Windows installations without increasing the rateof errors above what we currently see (given that invalid characters arecurrently replaced with '?'). It's not a 100% solution, but it's a 99%solution where the 1% is not handled well by anyone.


Cheers,
Steve

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

Reply via email to