Stephen J. Turnbull wrote: > Please define "character," and explain how its semantics map to > Python's unicode objects.
One of the 65 abstract entities referred to in the RFC and represented in that RFC by certain visual glyphs. There is a subset of the Unicode code points that are conventionally associated with very similar glyphs, so that there is an obvious one-to-one mapping between these entities and those Unicode code points. These entities therefore have a natural and obvious representation using Python unicode strings. > No, base64 isn't a wire protocol. Rather, it's a schema for a family > of wire protocols, whose alphabets are heuristically chosen on the > assumption that code units which happen to correspond to alpha-numeric > code points in a commonly-used coded character set are more likely to > pass through a communication channel without corruption. Yes, and it's up to the programmer to choose those code units (i.e. pick an encoding for the characters) that will, in fact, pass through the channel he is using without corruption. I don't see how any of this is inconsistent with what I've said. > Only if you do no transformations that will harm the base64-encoding. > ... It doesn't allow any of the > usual transformations on characters that might be applied globally to > a mail composition buffer, for example. I don't understand that. Obviously if you rot13 your mail message or turn it into pig latin or something, it's going to mess up any base64 it might contain. But that would be a silly thing to do to a message containing base64. Given any piece of text, there are things it makes sense to do with it and things it doesn't, depending entirely on the use to which the text will eventually be put. I don't see how base64 is any different in this regard. > So then you bring it right back in with base64. Now they need to know > about bytes<->unicode codecs. No, they need to know about the characteristics of the channel over which they're sending the data. Base64 is designed for situations in which you have a *text* channel that you know is capable of transmitting at least a certain subset of characters, where "character" means whatever is used as input to that channel. In Py3k, text will be represented by unicode strings. So a Py3k text channel should take unicode as its input, not bytes. I think we've got a bit sidetracked by talking about mime. I wasn't actually thinking about mime, but just a plain text message into which some base64 data was being inserted. That's the way we used to do things in the old days with uuencode etc, before mime was invented. Here, the "channel" is NOT the socket or whatever that the ultimate transmission takes place over -- it's the interface to your mail sending software that takes a piece of plain text and sends it off as a mail message somehow. In Py3k, if a channel doesn't take unicode as input, then it's not a text channel, and it's not appropriate to be using base64 with it directly. It might be appropriate to to use base64 followed by some encoding, but the programmer needs to be aware of that and choose the encoding wisely. It's not possible to shield him from having to know about encodings in that situation, even if the encoding is just ascii. Trying to do so will just lead to more confusion, in my opinion. Greg _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com