Eryk Sun <eryk...@gmail.com> added the comment:

> Vice versa, using bytes objects cannot represent all file names 
> on Windows (in the standard mbcs encoding), hence Windows 
> applications should use string objects to access all files.

This is outdated advice that should be removed, or at least reworded to 
emphasize that the 'mbcs' encoding is only used in legacy mode, with a link to 
the documentation of sys._enablelegacywindowsfsencoding [1].

Starting in Python 3.6, the default filesystem encoding in Windows is UTF-8. 
Internally, what happens is that a UTF-8 byte string gets translated to UTF-16 
(2 or 4 bytes per character), the native Unicode encoding of the Windows API. 

A caveat is that Windows filesystems use 16-bit characters that are not 
restricted to valid Unicode. In particular, ordinals U+D800-U+DFFF are not 
reserved for use in surrogate pairs. This is "Wobbly" Unicode, and the 
filesystem encoding thus needs to be "Wobbly Transformation Format, 8-bit" 
(WTF-8). This is implemented in Python by setting the encode errors handler to 
"surrogatepass", in contrast to using "surrogateescape" in POSIX. For example, 
os.fsencode('\ud800') succeeds in Windows but fails in POSIX, while 
os.fsdecode(b'\x80') fails in Windows but succeeds in POSIX. The latter case is 
not a practical problem since filesystem functions will never return an invalid 
WTF-8 byte string.

---
[1] 
https://docs.python.org/3/library/sys.html#sys._enablelegacywindowsfsencoding

----------
components: +Unicode, Windows
nosy: +eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, 
zach.ware

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43395>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to