On 09Feb2016 2017, Stephen J. Turnbull wrote:
> The problem here is the protocol that Python uses to return bytes paths,
> and that protocol is inconsistent between APIs and information is lost.
No, the problem is that the necessary information simply isn't always
available. Not even today: think removable media, especially archival
content. Also network file systems: I don't know if it still happens,
but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory,
and sometimes two of those in the *same path*. (Don't ask me how
non-malicious users managed to do the latter!)
But if we return bytes paths and the user passes them back in unchanged,
that should be irrelevant. The earlier issue was that that doesn't work
(e.g. a bytes path from os.scandir couldn't be passed back into open()).
> It really requires going through all the OS calls and either (a) making
> them consistently decode bytes to str using the declared FS encoding
> (currently 'mbcs', but I see no reason we can't make it 'utf_8'),
If it were that easy, it would have been done two decades ago. I'm no
fan of Windows[1], but it's obvious that Microsoft has devoted
enormous amounts of brainpower to the problem of encoding
rationalization since the early 90s. I don't think they would have
missed this idea.
I meant with Python's calls into the API. Anywhere Python does the
conversion from bytes to LPCWSTR (the UTF-16 type) there's a chance
it'll be wrong.
Your earlier comments (regarding encoding/decoding to/from Unicode,
which I didn't have anything valuable to add to) basically reflect the
fact that developers need to treat bytes paths as blobs on all platforms
and the core Python runtime needs to obtain and use them consistently.
Which means *always* using the Win32 *A APIs and never doing a
conversion ourselves.
Microsoft's solution here is the user's active code page, much like
*nix's solution as I understand it, except that where *nix will convert
_to_ the encoding as a normalized form, Windows will convert _from_ the
encoding to its UTF-16 "normalized" form. Back-compat concerns have
prevented any significant changes being made here, otherwise there
wouldn't be a 'bytes' interface at all. (Or more likely, everything
would be UTF-8 based, but back-compat is king in Windows-land.)
Cheers,
Steve
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com