On Fri, Oct 3, 2008 at 3:05 PM, Glenn Linderman <[EMAIL PROTECTED]> wrote: > On approximately 10/3/2008 12:53 PM, came the following characters from the > keyboard of James Y Knight: >> >> On Oct 3, 2008, at 3:24 PM, Glenn Linderman wrote: >>> >>> In order to work, the actual name must be preserved, or if translated, >>> must be a reversible, 1-to-1 translation. A lot of discussion here has >>> talked about reversible translations, but haven't noted the requirement that >>> it be 1-to-1... and if the translation produces something that looks like it >>> could be a file name, then the reverse translation is unlikely to be 1-to-1! >>> Somewhere, you need to add a flag that indicates whether or not a reverse >>> translation needs to be done, independently of the content of the translated >>> name. >> >> That's not true. Both the U+0000 and UTF-8b proposals are 1-to-1 >> transforms. >> >> James > > My understanding of the Posix file names is that any byte values are valid > except "/" and null. Is this a correct understanding? > > The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a > Unicode character stream. Call the original byte stream FOO. The > transformation then produces FOOTR, a set of Unicode code points. Now FOOTR > has a representation in UTF-8, which is a byte stream, call that byte stream > FOOTRUTF8. How, by looking at FOOTR, do you know whether it represents the > file name FOO or FOOTRUTF8 ? And remember that the user might provide a > Unicode character stream identical to FOOTR: should it be translated to FOO > or FOOTRUTF8 when creating a new file according to the user-supplied name?
UTF-8b produces an *invalid* unicode sequence, via lone scalars. Any attempt to encode or decode using a validating UTF-8 (or UTF-16/UTF-32) codec would reject them, which is why they can unambiguously be used. In other words, it's not unicode (despite a resemblence), so it's easy to be 1-to-1. > So the U+0000 transform may be 1-to-1 since it introduces null characters > into the translated "file name", which are effectively producing names that > are invalid according to the Posix file name standard ... but if it > introduces null characters into the translated "file name", then there is > file name parsing software that it will be incompatible with, which may be > as problematic as not translating the file names in the first place... deep > analysis would have to be used to determine which problem is larger, or more > significant. I've certainly been "guilty" of writing software that assumes > that there are no null characters in a file name. I've even been "guilty" > of writing software that assumes there are no space characters in a file > name, although I've tried to break that habit in recent years... Yup, U+0000 is unicode, but still can't be used with many external APIs, as it's a transformation of the real file name. The only real advantage is you can store it in certain external formats, but wouldn't you know it, XML isn't one of them[1]. Can you think of any common formats where it would work? [1] http://www.w3.org/International/questions/qa-controls -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com