On Sat, 19 Mar 2016 08:08 am, Chris Angelico wrote: > On Sat, Mar 19, 2016 at 8:02 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: >> Chris Angelico <ros...@gmail.com>: >>> On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <ma...@pacujo.net> >>> wrote: >>>> It may be that Python's Unicode abstraction is an untenable illusion >>>> because the underlying reality is 8-bit and there's no way to hide it >>>> completely. >>> >>> The underlying reality is 1-bit. Or maybe the underlying reality is >>> actually electrical signals that don't even have a clear definition of >>> "bits" and bounce between two states for a few fractions of a second >>> before settling. And maybe someone's implementing Python on the George >>> Banks Kite CPU, which consists of two cents' worth of paper and >>> string, on which text is actually represented by glyph. They're all >>> equally valid notions of "underlying reality". >>> >>> Text is an abstract concept, just as numbers are. >> >> The question is how tenable the illusion is. If the OS gave the >> appropriate guarantees (say, all pathnames are encoded Unicode strings), >> the abstraction could be maintained. Unfortunately, the legacy shines >> through making you wonder if Python has overreached prematurely with its >> Unicode HAL. > > The problem is not Python's Unicode strings, then. The problem is the > notion that path names are text. If they're text, they should be > exclusively text (although, for low-level efficiency, they're more > likely to be defined as "valid UTF-8 sequences" rather than "sequences > of Unicode codepoints"); since they're not, they are fundamentally > bytes. But that's not a problem with Python - it's a problem with the > file system.
One thing that NTFS gets right is that all path names are guaranteed to be well-formed, valid Unicode. I believe that they are stored in UTF-16, and unlike the ext file systems used on Linux, they are not arbitrary bytes. I believe that HFS+ on Apple Macs goes one step further and guarantees that paths are always fully normalised, so that it's impossible to have (e.g.) two files ã (U+00E3 LATIN SMALL LETTER A WITH TILDE) and ã (U+0061 LATIN SMALL LETTER A + U+0303 COMBINING TILDE) in the same directory. Unfortunately, backwards compatibility is holding Linux file systems back... -- Steven -- https://mail.python.org/mailman/listinfo/python-list