On Sep 29, 2008, at 11:11 PM, Stephen J. Turnbull wrote:
Except...that one over there. That's the whole point of UTF-8b:
correctly encoded names get decoded correctly and readably, and the
other cases get decoded into something unique that cannot possibly
conflict.
Sure. But there are lots of other operations besides encoding and
decoding that we do with filenames. How do you display a filename?
How about concatenating them to make paths? What do you do when you
want to mix a filename with other, well-formed strings? If you keep
the filenames internally in UTF-8b, you're going to need what amounts
to a whole string API for dealing with them, aren't you? If you're
not doing that, how is UTF-8b represented?
No, you keep the filenames internally in a PyUnicode object. All that
stuff *works* in Python today, with a UTF-8b decoded string.
Displaying a filename is encoding it into some other encoding. Like
this:
>>> '\x90\x90'.decode('utf-8b')
u'\udc90\udc90'
>>> u'\udc90\udc90'.encode('utf-8')
'\xed\xb2\x90\xed\xb2\x90'
So, that seems to work okay. Maybe I should try to display that in a
web browser. Shows up as 2 "unknown character" glyphs. Perfect.
If you want to mix a filename with other strings, you append them
together, or use os.path, same as always. You don't need any new
string API.
Since from what I've tried, things seem to work, I'd really like to
know what precisely does fail from the opponents of utf-8b.
And again: if utf-8b isn't acceptable, because it does break things in
some unknown-to-me way, I really can't imagine anything working but
just going back to byte-string access as the only API. It's really not
okay for the "obvious" APIs to be totally broken by unexpected input.
Think os.getcwd(), sys.argv, os.environ. You can't just ignore bad
files and call it done.
James
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com