On Fri, Dec 9, 2016 at 7:41 AM, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: > Frankly, I think that Apple HFS+ is the only modern file system that gets > Unicode right. Not only does it restrict file systems to valid UTF-8 > sequences, but it forces them to a canonical form to avoid the é é gotcha, > and treats file names as case preserving but case insensitive.
Windows NTFS doesn't normalize names to a canonical form. It also allows lone surrogate codes, which is invalid UTF-16. For case insensitive matches it converts to upper case, but the conversion table it uses is extremely conservative. Here's a simple function to convert a string to upper case using NT's runtime library function RtlUpcaseUnicodeChar: import ctypes ntdll = ctypes.WinDLL('ntdll') def upcase(s): up = [] for c in s: b = bytearray() for c in memoryview(c.encode('utf-16le')).cast('H'): c_up = ntdll.RtlUpcaseUnicodeChar(c) b += c_up.to_bytes(2, 'little') up.append(b.decode('utf-16le')) return ''.join(up) For example: >>> upcase('abcd') 'ABCD' >>> upcase('αβψδ') 'ΑΒΨΔ' >>> upcase('ßẞıİÅσςσ') 'ßẞıİÅΣςΣ' Attempting to create two files named 'ßẞıİÅσςσ' and 'ßẞıİÅΣςΣ' in the same NTFS directory fails, as expected: >>> s = 'ßẞıİÅσςσ' >>> open(s, 'x').close() >>> open(upcase(s), 'x').close() Traceback (most recent call last): File "<stdin>", line 1, in <module> FileExistsError: [Errno 17] File exists: 'ßẞıİÅΣςΣ' Note that Windows thinks standard case conversions of this name are all unique: >>> open(s.upper(), 'x').close() >>> open(s.lower(), 'x').close() >>> open(s.casefold(), 'x').close() >>> os.listdir() ['ssssıi̇åσσσ', 'SSẞIİÅΣΣΣ', 'ßßıi̇åσςσ', 'ßẞıİÅσςσ'] -- https://mail.python.org/mailman/listinfo/python-list