On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico:
2009/4/28 Glenn Linderman <v+pyt...@g.nevcal.com>:
The switch from PUA to half-surrogates does not resolve the issues with the
encoding not being a 1-to-1 mapping, though.  The very fact that you  think
you can get away with use of lone surrogates means that other people might,
accidentally or intentionally, also use lone surrogates for some other
purpose.  Even in file names.

It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.


Wrong.

An 8859-1 locale allows any byte sequence to placed into a POSIX filename.

And while U+DCFF is illegal alone in Unicode, it is not illegal in Python str values. And from my testing, Python 3's current UTF-8 encoder will happily provide exactly the bytes value you mention when given U+DCFF.


Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' -> '\udcff' -> b'\xff'


Others have made this suggestion, and it is helpful to the PEP, but not sufficient. As implemented as an error handler, I'm not sure that the b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 decoder is happy with it. Which, in my testing, it is.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to