On Mon, Sep 29, 2008 at 11:06 AM, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <[EMAIL PROTECTED]> wrote: > >> This approach (changing all path-handling functions to accept either bytes >> or string, but not both) is doomed in my eyes. First, there are lots of them, >> second, they are not only in os.path but in many modules and also in user >> code, and third, I see no clean way of implementing them in the specified >> way. >> (Just try to do it with os.path.join as an example; I couldn't find the >> good way to write it, only the bad and the ugly...) > > It doesn't have to be supported for all operations -- just enough to > be able to access all the system calls. and do the most basic pathname > manipulations (split and join -- almost everything else can be built > out of those). > >> If I had to choose, I'd still argue for the modified UTF-8 as filesystem >> encoding (if it were UTF-8 otherwise), despite possible surprises when a >> such-encoded filename escapes from Python. > > I'm having a hard time finding info about UTF-8b. Does anyone have a > decent link?
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html Scroll down to item D, near the bottom. It turns malformed bytes into lone (therefor malformed) surrogates. > I noticed that OSX has a different approach yet. I believe it insists > on valid UTF-8 filenames. It may even require some normalization but I > don't know if the kernel enforces this. I tried to create a file named > b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it > may be replacing all bad UTF8 sequences with their % encoding. I suspect linux will eventually take this route as well. If ext3 had an option for UTF-8 validation I know I'd want it on. That'd move the error to the program creating bogus file names, rather than those trying to read, display, and manage them. > The "set filesystem encoding to be Latin-1" approach has a certain > charm as well, but clearly would be a mistake on OSX, and probably on > other systems too (whenever the user doesn't think in Latin-1). Aye, it's a better hack than UTF-8b, but adding byte functions is even better. -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com