Stephen J. Turnbull wrote: > Greg> I'd be perfectly happy with ascii characters, but in Py3k, > Greg> the most natural place to keep ascii characters will be in > Greg> character strings, not byte arrays. > > Natural != practical.
That seems to be another thing we disagree about -- to me it seems both natural *and* practical. The whole business of stuffing binary data down a text channel is a practicality-beats-purity kind of thing. You wouldn't do it if you had a real binary channel available, but if you don't, it's better than nothing. > The base64 string is a representation of an object > that doesn't have text semantics. But the base64 string itself *does* have text semantics. That's the whole point of base64 -- to represent a non-text object *using* text. To me this is no different than using a string of decimal digit characters to represent an integer, or a string of hexadecimal digit characters to represent a bit pattern. Would you say that those are not text, either? What about XML? What would you consider the proper data type for an XML document to be inside a Python program -- bytes or text? I'm genuinely interested in your answer to that, because I'm trying to understand where you draw the line between text and non-text. You seem to want to reserve the term "text" for data that doesn't ever have to be understood even a little bit by a computer program, but that seems far too restrictive to me, and a long way from established usage. > Nor do base64 strings have text semantics: they can't even > be concatenated as text ... So if you > wish to concatenate the underlying objects, the base64 strings must be > decoded, concatenated, and re-encoded in the general case. You can't add two integers by concatenating their base-10 character representation, either, but I wouldn't take that as an argument against putting decimal numbers into text files. Also, even if we follow your suggestion and store our base64-encoded data in byte arrays, we *still* wouldn't be able to concatenate the original data just by concatenating those byte arrays. So this argument makes no sense either way. > IMO it's not worth preserving the very superficial > coincidence of "character representation" I disagree entirely that it's superficial. On the contrary, it seems to me to be very essence of what base64 is all about. If there's any "coincidence of representation" it's in the idea of storing the result as ASCII bit patterns in a byte array, on the assumption that that's probably how they're going to end up being represented down the line. That assumption could be very wrong. What happens if it turns out they really need to be encoded as UTF-16, or as EBCDIC? All hell breaks loose, as far as I can see, unless the programmer has kept very firmly in mind that there is an implicit ASCII encoding involved. It's exactly to avoid the need for those kinds of mental gymnastics that Py3k will have a unified, encoding-agnostic data type for all character strings. > I think that fact that favoring the coincidence of representation > leads you to also deprecate the very natural use of the codec API to > implement and understand base64 is indicative of a deep problem with > the idea of implementing base64 as bytes->unicode. Not sure I'm following you. I don't object to implementing base64 as a codec, only to exposing it via the same interface as the "real" unicode codecs like utf8, etc. I thought we were in agreement about that. If you're thinking that the mere fact its input type is bytes and its output type is characters is going to lead to its mistakenly appearing via that interface, that would be a bug or design flaw in the mechanism that controls which codecs appear via that interface. It needs to be controlled by something more than just the input and output types. Greg _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com