On Sat, 18 Feb 2006 23:33:15 +0100, Thomas Wouters <[EMAIL PROTECTED]> wrote:
>On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote: > [...] >> > - The return value for the non-unicode encodings depends on the value of >> > the encoding argument. > >> Not really: you'll always get a basestring instance. > But actually basestring is weird graft of semantic apples and empty bags IMO. unicode is essentially an abstract character vector type, and str is an abstract binary octet vector type having nothing to do with characters except by inferential association with an encoding. >Which is not a particularly useful distinction, since in any real world >application, you have to be careful not to mix unicode with (non-ascii) >bytestrings. The only way to reliably deal with unicode is to have it >well-contained (when migrating an application from using bytestrings to >using unicode) or to use unicode everywhere, decoding/encoding at >entrypoints. Containment is hard to achieve. > >> Still, I believe that this is an educational problem. There are >> a couple of gotchas users will have to be aware of (and this is >> unrelated to the methods in question): >> >> * "encoding" always refers to transforming original data into >> a derived form ISTM encoding separates type information from the source and sets it aside as the identity of the encoding, and renders the data in a composite of more primitive types, octets being the most primitive short of bits. >> >> * "decoding" always refers to transforming a derived form of >> data back into its original form Decoding of a composite of primitives requires additional separate information (namely identification of the encoding) to create a higher composite type. >> >> * for Unicode codecs the original form is Unicode, the derived >> form is, in most cases, a string You mean a str instance, right? Where the original type as character vector is gone. That's not a string in the sense of character string. >> >> As a result, if you want to use a Unicode codec such as utf-8, >> you encode Unicode into a utf-8 string and decode a utf-8 string >> into Unicode. s/string/str instance/ >> >> Encoding a string is only possible if the string itself is >> original data, e.g. some data that is supposed to be transformed >> into a base64 encoded form. note what base64 really is for. It's essence is to create a _character_ sequence which can succeed in being encoded as ascii. The concept of base64 going str->str is really a mental shortcut for s_str.decode('base64').encode('ascii'), where 3 octets are decoded as code for 4 characters modulo padding logic. >> >> Decoding Unicode is only possible if the Unicode string itself >> represents a derived form, e.g. a sequence of hex literals. Again, it's an abbreviation, e.g. print u'4cf6776973'.encode('hex_chars_to_octets').decode('latin-1') Should print Löwis > >Most of these gotchas would not have been gotchas had encode/decode only >been usable for unicode encodings. > >> > That is why I disagree with the hypergeneralization of the encode/decode >> > methods >[..] >> That's because you only look at one specific task. > >> Codecs also unify the various interfaces to common encodings >> such as base64, uu or zip which are not Unicode related. I think the trouble is that these view the transformations as octets->octets whereas IMO decoding should always result in a container type that knows what it is semantically without association with external use-this-codec information. IOW, octets.decode('zip') -> archive archive.encode('bzip') -> octets You could even subclass octet to make archive that knows it's an octet vector representing a decoded zip, so it can have an encode method that could (specifying 'zip' again) encode itself back to the original zip, or an alternate method to encode itself as something else, which you couldn't do from plain octets without specifying both transformations at once. (hence the .recode idea, but I don't think that is as pure. The constructor for the container type could also be used, like Archive(octets, 'zip') analogous to unicode('abc', 'ascii') IOW octets + decoding info -> container type instance container type instance + encoding info -> octets > >No, I think you misunderstand. I object to the hypergeneralization of the >*encode/decode methods*, not the codec system. I would have been fine with >another set of methods for non-unicode transformations. Although I would >have been even more fine if they got their encoding not as a string, but as, >say, a module object, or something imported from a module. > >Not that I think any of this matters; we have what we have and I'll have to >live with it ;) Probably. BTW, you may notice I'm saying octet instead of bytes. I have another post on that, arguing that the basic binary information type should be octet, since binary files are made of octets that have no instrinsic numerical or character significance. See other post if interested ;-) Regards, Bengt Richter
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com