Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman Mon, 27 Apr 2009 00:14:33 -0700

On approximately 4/25/2009 5:22 AM, came the following characters fromthe keyboard of Martin v. Löwis:

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.


Why is it necessary that you are able to make this distinction?

It is necessary that programs (not me) can make the distinction, so thatit knows whether or not to do the funny-encoding or not. If a name isfunny-decoded when the name is accessed by a directory listing, it needsto be funny-encoded in order to open the file.

Picking a character (I don't find U+F01xx in the
Unicode standard, so I don't know what it is)


It's a private use area. It will never carry an official character
assignment.

I know that U+F0000 - U+FFFFF is a private use area. I don't find adefinition of U+F01xx to know what the notation means. Are you pickinga particular character within the private use area, or a particularrange, or what?

As I realized in the email-sig, in talking about decoding corrupted
headers, there is only one way to guarantee this... to encode _all_
character sequences, from _all_ interfaces.  Basically it requires
reserving an escape character (I'll use ? in these examples -- yes, an
ASCII question mark -- happens to be illegal in Windows filenames so
all the better on that platform, but the specific character doesn't
matter... avoiding / \ and . is probably good, though).


I think you'll have to write an alternative PEP if you want to see
something like this implemented throughout Python.

I'm certainly not experienced enough in Python development processes orinternals to attempt such, as yet. But somewhere in 25 years ofprogramming, I picked up the knowledge that if you want to have a 1-to-1reversible mapping, you have to avoid data puns, mappings of twodifferent data values into a single data value. Your PEP, as firstwritten, didn't seem to do that... since there are two interfaces fromwhich to obtain data values, one performing a mapping from bytes to"funny invalid" Unicode, and the other performing no mapping, butaccepting any sort of Unicode, possibly including "funny invalid"Unicode, the possibility of data puns seems to exist. I may bemisunderstanding something about the use cases that prevent these twosources of "funny invalid" Unicode from ever coexisting, but if so,perhaps you could point it out, or clarify the PEP. I'll try to rereadit again... could you post a URL to the most up-to-date version of thePEP, since I haven't seen such appear here, and the version I found viaa Google search seems to be the original?



--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to