>>>>> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:
Greg> Stephen J. Turnbull wrote: >> Base64 is a (family of) wire protocol(s). It's not clear to me >> that it makes sense to say that the alphabets used by "baseNN" >> encodings are composed of characters, Greg> Take a look at [this that the other] Those references use "character" in an ambiguous and ill-defined way. Trying to impose Python unicode object semantics on "vague characters" is a bad idea IMO. Greg> Which seems to make it perfectly clear that the result of Greg> the encoding is to be considered as characters, which are Greg> not necessarily going to be encoded using ascii. Please define "character," and explain how its semantics map to Python's unicode objects. Greg> So base64 on its own is *not* a wire protocol. Only after Greg> encoding the characters do you have a wire protocol. No, base64 isn't a wire protocol. Rather, it's a schema for a family of wire protocols, whose alphabets are heuristically chosen on the assumption that code units which happen to correspond to alpha-numeric code points in a commonly-used coded character set are more likely to pass through a communication channel without corruption. Note that I have _precisely_ defined what I mean. You still have the problem that you haven't defined character, and that is a real problem, see below. >> I don't see any case for "correctness" here, only for >> convenience, Greg> I'm thinking of convenience, too. Keep in mind that in Py3k, Greg> 'unicode' will be called 'str' (or something equally neutral Greg> like 'text') and you will rarely have to deal explicitly Greg> with unicode codings, this being done mostly for you by the Greg> I/O objects. So most of the time, using base64 will be just Greg> as convenient as it is today: base64_encode(my_bytes) and Greg> write the result out somewhere. Convenient, yes, but incorrect. Once you mix those bytes with the Python string type, they become subject to all the usual operations on characters, and there's no way for Python to tell you that you didn't want to do that. Ie, Greg> Whereas if the result is text, the right thing happens Greg> automatically whatever the ultimate encoding turns out to Greg> be. You can take the text from your base64 encoding, combine Greg> it with other text from any other source to form a complete Greg> mail message or xml document or whatever, and write it out Greg> through a file object that's using any unicode encoding at Greg> all, and the result will be correct. Only if you do no transformations that will harm the base64-encoding. This is why I say base64 is _not_ based on characters, at least not in the way they are used in Python strings. It doesn't allow any of the usual transformations on characters that might be applied globally to a mail composition buffer, for example. In other words, you don't escape from the programmer having to know what he's doing. EIBTI, and the setup I advocate forces the programmer to explicitly decide where to convert base64 objects to a textual representation. This reminds him that he'd better not touch that text. Greg> The reason I say it's *corrrect* is that if you go straight Greg> from bytes to bytes, you're *assuming* the eventual encoding Greg> is going to be an ascii superset. The programmer is going Greg> to have to know about this assumption and understand all its Greg> consequences and decide whether it's right, and if not, do Greg> something to change it. I'm not assuming any such thing, except in the context of analysis of implementation efficiency. And the programmer needs to know about the semantics of text that is actually a base64-encoded object, and that they are different from string semantics. This is something that programmers are used to dealing with in the case of Python 2.x str and C char[]; the whole point of the unicode type is to allow the programmer to abstract from that when dealing human-readable text. Why confuse the issue. >> And in the classroom, you're just going to confuse students by >> telling them that UTF-8 --[Unicode codec]--> Python string is >> decoding but UTF-8 --[base64 codec]--> Python string is >> encoding, when MAL is telling them that --> Python string is >> always decoding. Greg> Which is why I think that only *unicode* codings should be Greg> available through the .encode and .decode interface. Or Greg> alternatively there should be something more explicit like Greg> .unicode_encode and .unicode_decode that is thus restricted. Greg> Also, if most unicode coding is done in the I/O objects, Greg> there will be far less need for programmers to do explicit Greg> unicode coding in the first place, so likely it will become Greg> more of an advanced topic, rather than something you need to Greg> come to grips with on day one of using unicode, like it is Greg> now. So then you bring it right back in with base64. Now they need to know about bytes<->unicode codecs. Of course it all comes down to a matter of judgment. I do find your position attractive, but I just don't think it will work for naive users the way you think it will. It's also possible to make a precise statement of the rationale for my approach, which I have not been able to achieve for the "base64 uses characters" approach, and nobody else has demonstrated one, yet. On the other hand, I don't think either approach imposes substantially more burden on the advanced programmer, nor does either proposal involve a specific restriction on usage (aka "dumbing down the language"). -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com