Steve Dower writes: > On 09Feb2016 1801, Andrew Barnert wrote: > > On Feb 9, 2016, at 17:37, Steve Dower <pyt...@stevedower.id.au > > <mailto:pyt...@stevedower.id.au>> wrote: > > > >> Could we perhaps redefine bytes paths on Windows as utf8 and use > >> Unicode everywhere internally? > > > > When you receive bytes from argv, stdin, a text file, a GUI, a named > > pipe, etc., and then use them as a path, Python treating them as UTF-8 > > would break everything. > > Sure, but that's already broken today if you're communicating bytes via > some protocol without manually managing the encoding, in which case you > should be decoding it (and potentially re-encoding to > sys.getfilesystemencoding()).
The problem is that treating them as UTF-8 in Python will raise errors on any file name that isn't valid UTF-8, or corrupt the filename if you use one of the handlers available in Python 2. If you use Latin-1, that (1) handles all 256 bytes, and (2) roundtrips to Unicode. Although semantically useless to users, if it's just read from a directory, then used to manipulate file contents, no problem. Of course if you then edit a multibyte file name as Unicode it is likely that all hell will break loose. But you can take that sentence and s/Unicode/bytes/, too. :-/ > The problem here is the protocol that Python uses to return bytes paths, > and that protocol is inconsistent between APIs and information is lost. No, the problem is that the necessary information simply isn't always available. Not even today: think removable media, especially archival content. Also network file systems: I don't know if it still happens, but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory, and sometimes two of those in the *same path*. (Don't ask me how non-malicious users managed to do the latter!) > It really requires going through all the OS calls and either (a) making > them consistently decode bytes to str using the declared FS encoding > (currently 'mbcs', but I see no reason we can't make it 'utf_8'), If it were that easy, it would have been done two decades ago. I'm no fan of Windows[1], but it's obvious that Microsoft has devoted enormous amounts of brainpower to the problem of encoding rationalization since the early 90s. I don't think they would have missed this idea. Footnotes: [1] Its complete inability to DTRT for mixed English and Japanese was what drove me to Unix-like OSes in the early 90s. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com