2008/10/3 Glenn Linderman <[EMAIL PROTECTED]>: > My understanding of the Posix file names is that any byte values are valid > except "/" and null. Is this a correct understanding?
Yes (well, names "." and ".." are reserved, and there might be length restrictions). > The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a > Unicode character stream. Call the original byte stream FOO. The > transformation then produces FOOTR, a set of Unicode code points. Now FOOTR > has a representation in UTF-8, which is a byte stream, call that byte stream > FOOTRUTF8. How, by looking at FOOTR, do you know whether it represents the > file name FOO or FOOTRUTF8 ? In the unpaired surrogate scheme: there is no FOOTRUTF8 because UTF-8 can encode only Unicode scalar values (which exclude surrogates). Python strings can contain surrogates (in 4-byte builds) or unpaired surrogates which are malformed UTF-16 (in 2-byte builds) — in the filename context they can't be represented in UTF-8 so they must mean escaped bytes. In the U+0000 scheme: FOOTRUTF8 contains a 0 byte, so the filename must mean FOO. > but if it > introduces null characters into the translated "file name", then there is > file name parsing software that it will be incompatible with, which may be > as problematic as not translating the file names in the first place... What do you mean by "not translating"? If a piece of software validates filenames while they are represented by Unicode strings, then they must have been somehow translated from byte strings (on POSIX) or UTF-16-assumed-but-not-guaranteed strings (on Windows). -- Marcin Kowalczyk [EMAIL PROTECTED] http://qrnik.knm.org.pl/~qrczak/ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com