Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

MRAB Sat, 25 Apr 2009 10:28:43 -0700

Martin v. Löwis wrote:

I see two main user-oriented use cases for the resulting Unicode
strings this PEP will produce on all systems: displaying a list of
filenames for the user to select from (an open file dialog), and
allowing a user to edit or supply a filename (a save dialog or a
rename control).


There are more, in particular the case "user passes a file name
on the command line", and "web server passes URL in environment
variable".

It's clear what this PEP provides for the former. On well-behaved
systems where a simpler filesystemencoding approach would work, the
results are identical; the user can select filenames that are what he
expects to see on both Unix and Windows. On less well-behaved systems,
some characters may appear as junk in the middle of the name (or would
they be invisible?)


Depends on the rendering. Try "print u'\udc00'" in your terminal to see
what happens; for me, it renders the glyph for "replacement character".
In GUI applications, you often see white boxes (rectangles).

What I don't find clear is what the risks are for the latter. On the
less well behaved system, a user may well attempt to use this python
application to fix filenames. Can we estimate a likelihood that edits
to the names would result in a Unicode string that can no longer be
encoded with the python-escape? Will a new name fully provided by a
user on his keyboard (ignoring copy and paste) almost always safely
encode?


That very much depends on the system setup, and your impression is
right that the PEP doesn't address it - it only deals with cases
where you get random unsupported bytes; getting random unsupported
characters from the user is not considered.

If the user has the locale setup in way that matches his keyboard,
it should work all fine - and will already, even without the PEP.
If the user enters a character that doesn't directly map to a
good file name, you get an exception, and have to tell the user
to pick a different filename.

Notice that it may fail at several layers:
- it may be that characters entered are not supported in what
  Python choses as the file system encoding.
- it may be that the characters are not supported by the file
  system, e.g. leading spaces in Win32.
- it may be that the file cannot be renamed because the target
  name already exists.
In all these cases, the application has to ask the user to
reconsider; for at least the last case, it should be prepared
to do that, anyway (there is also the case where renaming fails
because of lack of permissions; in that case, picking a different
file name won't help).

This has made me think about what happens going the other way, ie when auser-supplied Unicode string needs to be converted to UTF-8b. Thatshould also be reversible.


Therefore:

When encoding using UTF-8b, codepoints in the range U+DC80..U+DCFF
should map to bytes 0x80..0xFF; all other codepoints, including the
remaining half surrogates, should be encoded normally.

When decoding using UTF-8b, undecodable bytes in the range 0x80..0xFF
should map to U+DC80..U+DCFF; all other bytes, including the encodings
for the remaining half surrogates, should be decoded normally.

This will ensure that even when the user has provided a string
containing half surrogates it can be encoded to bytes and then decoded
back to the original string.
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to