Eryk Sun <eryk...@gmail.com> added the comment:

The test assumes that Unix filesystems store names as arbitrary sequences of 
bytes, with only ASCII slash and null reserved. Windows NTFS stores names as 
arbitrary sequences of 16-bit words, with many reserved ASCII characters 
including \/:*?<>"| and control characters 0x00-0x1F. WSL implements a UTF-8 
filesystem encoding over this by transcoding bytes from UTF-8 to UTF-16LE and 
escaping reserved characters (excepting slash and null) as sequences that begin 
with "#" (e.g. "<#" -> "#003C#0023"). The latter is only visible from Windows 
in the distro's "LocalState\rootfs" tree.

This scheme fails for TESTFN_UNDECODABLE. Bytes that can't be transcoded to 
UTF-16LE are replaced by the replacement character U+FFFD. For example:

    >>> n = b'\xff'
    >>> open(n, 'w').close()
    >>> os.listdir(b'.')
    [b'\xef\xbf\xbd']
    >>> hex(ord(os.listdir('.')[0]))
    '0xfffd'

WSL could address this by abandoning their current "#" escaping approach to 
instead translate all reserved and undecodable bytes to the U+DC00-U+DCFF 
surrogate range, like Python's "surrogateescape" error handler. The Windows API 
could even support this with a new flag for MultiByteToWideChar and 
WideCharToMultiByte.

----------
nosy: +eryksun

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38454>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to