2009/4/28 Glenn Linderman <v+pyt...@g.nevcal.com>: > The switch from PUA to half-surrogates does not resolve the issues with the > encoding not being a 1-to-1 mapping, though. The very fact that you think > you can get away with use of lone surrogates means that other people might, > accidentally or intentionally, also use lone surrogates for some other > purpose. Even in file names.
It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff' -- Lino Mastrodomenico _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com