Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman Thu, 30 Apr 2009 01:05:35 -0700

On approximately 4/29/2009 7:50 PM, came the following characters fromthe keyboard of Aahz:

On Thu, Apr 30, 2009, Cameron Simpson wrote:

The lengthy discussion mostly revolves around:


  - Glenn points out that strings that came _not_ from listdir, and that are
    _not_ well-formed unicode (== "have bare surrogates in them") but that
    were intended for use as filenames will conflict with the PEP's scheme -
    programs must know that these strings came from outside and must be
    translated into the PEP's funny-encoding before use in the os.*
    functions. Previous to the PEP they would get used directly and
    encode differently after the PEP, thus producing different POSIX
    filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
    using a rare-in-filenames character.
    That would avoid the issue with "outside' strings that contain
    surrogates. To my mind it just moves the punning from rare illegal
    strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
    os.listdir but a subclass of string (or at least a duck-type of
    string) that knows where it came from and is also handily
    recognisable as not-really-a-string for purposes of deciding
    whether is it PEP-funny-encoded by direct inspection.


Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.

I'll agree that once other misconceptions were explained away, that theremaining issues are those Cameron summarized. Thanks for the summary!

Point two could be modified because I've changed my opinion; I like theinvariant Cameron first (I think) explicitly stated about the PEP as itstands, and that I just reworded in another message, that the stringsthat are altered by the PEP in either direction are in the subset ofstrings that contain fake (from a strict Unicode viewpoint) characters.I still think an encoding that uses mostly real characters that haveassigned glyphs would be better than the encoding in the PEP; but wouldnow suggest that an escape character be a fake character.

I'll note here that while the PEP encoding causes illegal bytes to betranslated to one fake character, the 3-byte sequence that looks likethe range of fake characters would also be translated to a sequence of 3fake characters. This is 512 combinations that must be translated, andunderstood by the user (or at least by the programmer). The "escapesequence" approach requires changing only 257 combinations, and eachaltered combination would result in exactly 2 characters. Hence, thisseems simpler to understand, and to manually encode and decode fordebugging purposes.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to