On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz:
On Thu, Apr 30, 2009, Cameron Simpson wrote:
The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
    _not_ well-formed unicode (== "have bare surrogates in them") but that
    were intended for use as filenames will conflict with the PEP's scheme -
    programs must know that these strings came from outside and must be
    translated into the PEP's funny-encoding before use in the os.*
    functions. Previous to the PEP they would get used directly and
    encode differently after the PEP, thus producing different POSIX
    filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
    using a rare-in-filenames character.
    That would avoid the issue with "outside' strings that contain
    surrogates. To my mind it just moves the punning from rare illegal
    strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
    os.listdir but a subclass of string (or at least a duck-type of
    string) that knows where it came from and is also handily
    recognisable as not-really-a-string for purposes of deciding
    whether is it PEP-funny-encoded by direct inspection.

Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.

I'll agree that once other misconceptions were explained away, that the remaining issues are those Cameron summarized. Thanks for the summary!

Point two could be modified because I've changed my opinion; I like the invariant Cameron first (I think) explicitly stated about the PEP as it stands, and that I just reworded in another message, that the strings that are altered by the PEP in either direction are in the subset of strings that contain fake (from a strict Unicode viewpoint) characters. I still think an encoding that uses mostly real characters that have assigned glyphs would be better than the encoding in the PEP; but would now suggest that an escape character be a fake character.

I'll note here that while the PEP encoding causes illegal bytes to be translated to one fake character, the 3-byte sequence that looks like the range of fake characters would also be translated to a sequence of 3 fake characters. This is 512 combinations that must be translated, and understood by the user (or at least by the programmer). The "escape sequence" approach requires changing only 257 combinations, and each altered combination would result in exactly 2 characters. Hence, this seems simpler to understand, and to manually encode and decode for debugging purposes.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to