>>>>> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:
Greg> But the base64 string itself *does* have text semantics. What do you mean by that? The strings of abstract "characters" defined by RFC 3548 cannot be concatenated in general, they may only be split at 4-character intervals, they can't be reliably searched as text for a given octet or substring of the underlying binary object, and deletion or insertion of octets can't be done without decoding and re-encoding the whole string. And of course humans can make neither head nor tail of them in most cases. The only useful semantics that they have is "you can apply the base64 decoder" to them. In other words, by far the most important effect of endowing that string with "text semantics" is to force programmers to remember not to use them. Do you really mean to call that "text semantics"? Greg> To me this is no different than using a string of decimal Greg> digit characters to represent an integer, or a string of Greg> hexadecimal digit characters to represent a bit Greg> pattern. Would you say that those are not text, either? "No different"? OK, I'll take you at your word.<wink> T2YgY291cnNlIEkgd291bGQgY29uc2lkZXIgdGhvc2UgdGV4dC4gIFRoZXkncmUgaHVtYW4t cmVhZGFibGUu Greg> What about XML? What would you consider the proper data type Greg> for an XML document to be inside a Python program -- bytes Greg> or text? Neither. If I must chose one of those ... well, "I know I have a choice of programming languages, and I won't be using Python for this task." Fortunately, there's ElementTree. What you presumably meant was "what would you consider the proper type for (P)CDATA?" And my answer is "text" for text, and "bytes" for binary data (eg, image or audio). Let ElementTree handle the wire format: if an Element's text attribute has type "bytes", convert to base64 and then to the appropriate coded character set for the channel. I don't wanna know about the content transfer encoding, and I should have no need to. Greg> You seem to want to reserve the term "text" for data that Greg> doesn't ever have to be understood even a little bit by a Greg> computer program, but that seems far too restrictive to me, Greg> and a long way from established usage. What I want to reserve "text" for is data streams that nonprogrammer humans might want to manipulate with pencil, paper, scissors, and paste, or programmers with re and text[n:m] = text2. I have no objection to computers using it, too, and even asking us humans to respect some restrictions on the use of [:]= and +. But to tell us to give up those operations entirely makes it into non-text IMO. Greg> [The] assumption [that the channel is ASCII-compatible] could Greg> be very wrong. What happens if it turns out they really need Greg> to be encoded as UTF-16, or as EBCDIC? All hell breaks Greg> loose, as far as I can see, unless the programmer has kept Greg> very firmly in mind that there is an implicit ASCII encoding Greg> involved. Greg> It's exactly to avoid the need for those kinds of mental Greg> gymnastics Agreed, such bookkeeping would be annoying. But there's no _need_ for it any way you look at it: just leave binary objects as-is until you're ready to put them on the wire.[1] Attach a binary-to-wire codec to this end of the wire, and inject your data there. This puts the responsibility where it belongs: with the author of the wire driver. That's the point, which you already mentioned: nobody but authors of wire drivers[2] and introspective code will need to _explicitly_ call .encode('base64'). Greg> that Py3k will have a unified, encoding-agnostic data type Greg> for all character strings. Yeah, but if base64 produces character strings, Unicode becomes a unified, encoding-agnostic data type for all data. Just base64 everything, and now we don't need a bytes type, right? Note that this is precisely what Emacs/MULE does (with a variable width non-Unicode internal encoding and "base256" instead of base64), so as demented as it may sound, it's all too historically plausible. And it can be implemented, by accident, at the application program level. Why expose our users to increased risk of such trouble? Footnotes: [1] Of course you may want to manipulate the binary data, even as text. But who's going to use the base64 format for that purpose? [2] I mean to include those who are writing the git.object_id(), PGP_key.fingerprint(), and ElementTree.write() methods. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com