Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman Tue, 28 Apr 2009 13:37:41 -0700

On approximately 4/28/2009 1:25 PM, came the following characters fromthe keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as

the PUA characters...


Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

C. File on disk with the invalid surrogate code, accessed via the strinterface, no decoding happens, matches in memory the file on disk withthe byte that translates to the same surrogate, accessed via the bytesinterface. Ambiguity.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to