On Fri, Oct 3, 2008 at 5:02 PM, Glenn Linderman <[EMAIL PROTECTED]> wrote: > On approximately 10/3/2008 2:36 PM, came the following characters from the > keyboard of Adam Olsen: >> >> UTF-8b produces an *invalid* unicode sequence, via lone scalars. Any >> attempt to encode or decode using a validating UTF-8 (or >> UTF-16/UTF-32) codec would reject them, which is why they can >> unambiguously be used. >> >> In other words, it's not unicode (despite a resemblence), so it's easy >> to be 1-to-1. > > Sort of. There is no numerical reason they cannot be represented in a > UTF-8-like numeric encoding scheme. It is only rules and regulations that > prevent it. So FOOTRUTF8 can exist, just not legally. If the expectation > is that an illegal UTF-16 code can be used, to permit the UTF-8b translation > scheme to work at all, then it seems reasonable to expect than an illegal > translation of it to UTF-8 might happen also, which means that the > transformation isn't 1-to-1!
No, UTF-8b can't be translated to UTF-8. It's illegal. > I think someone demonstrated the use of unpaired surrogates in the Windows > filename context the other day. Whether that is a bug or not, it is the > current state of affairs, someone might read a name from Windows and want to > create it on Posix... what happens? If we implement UTF-8b, I know what > would happen. But what would happen if we don't, today, on a Posix Python > 3? Would it use FOOTRUTF8 or would it generate an error? I don't suppose > it matters a lot, it is stupidity to use such names whether or not the > prevention of it is enforced. If python worked properly? The illegal unicode object would get an encoding error when you tried to translate to UTF-8 to send it over to the Posix box. You'd have alter all the software that touches it to use your looks-like-but-isn't-quite-unicode, rather than using the real unicode. That's why I favour validating the windows API too, and making the raw API be the raw UTF-16 (rather than letting it get encoded into a single-byte encoding). The rawness is what bytes need, not ASCII similarity. > But if someone on Posix is creating non-Python software that uses illegal > lone surrogates, illegally UTF-8 coding them to create the file, and then > giving them to a Python program to manipulate the content, things could get > confused, if UTF-8b translations happen under the Python covers... the > Python program would attempt to open a different file than the non-Python > software created. No, they can't illegal use UTF-8. It's not UTF-8, period. It's just garbage. > Seems like attempts to manipulate and transform names are doomed to failure; > the approach of having a bytes level interface seems to be the correct one, > glad that seems to be the approach that Victor is implementing and Guido is > favoring, although it is a pity that it can't be fully encapsulated into an > object in time for 3.0, leaving us with multiple APIs for file access, and a > potential future translation to an encapsulated object approach. the bytes object covers 90% of the raw usage. The other 10% is a lossy encoding to unicode. I much prefer that to be explicit, so an attribute may do.. say b.decode('UTF-8', 'replace')? Or do we need a subtype of bytes, just to reduce that to 5-8 characters? -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com