Hrvoje Niksic <hrvoje.nik...@avl.com> wrote: > Assume a UTF-8 locale. A file named b'\xff', being an invalid UTF-8 > sequence, will be converted to the half-surrogate '\udcff'. However, > a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be > converted to '\udcff'. Those are quite different POSIX pathnames; how > will Python know which one it was when I later pass '\udcff' to > open()? > > > [1] > I'm assuming that it's valid UTF8 because it passes through Python > 2.5's '\xed\xb3\xbf'.decode('utf-8'). I don't claim to be a UTF-8 > expert.
I'm not a UTF-8 expert either, but I got bitten by this yesterday. I was uploading a file to a Google Search Appliance and it was rejected as invalid UTF-8 despite having been encoded into UTF-8 by Python. The cause was a byte sequence which decoded to a half surrogate similar to your example above. Python will happily decode and encode such sequences, but as I found to my cost other systems reject them. Reading wikipedia implies that Python is wrong to accept these sequences and I think (though I'm not a lawyer) that RFC 3629 also implies this: "The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters." and "Implementations of the decoding algorithm above MUST protect against decoding invalid sequences." _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com