On Sat, 18 Feb 2006 23:33:15 +0100, Thomas Wouters <[EMAIL PROTECTED]> wrote:

>On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote:
>
[...]
>> >  - The return value for the non-unicode encodings depends on the value of
>> >    the encoding argument.
>
>> Not really: you'll always get a basestring instance.
>
But actually basestring is weird graft of semantic apples and empty bags IMO.
unicode is essentially an abstract character vector type,
and str is an abstract binary octet vector type having nothing to do with 
characters
except by inferential association with an encoding.

>Which is not a particularly useful distinction, since in any real world
>application, you have to be careful not to mix unicode with (non-ascii)
>bytestrings. The only way to reliably deal with unicode is to have it
>well-contained (when migrating an application from using bytestrings to
>using unicode) or to use unicode everywhere, decoding/encoding at
>entrypoints. Containment is hard to achieve.
>
>> Still, I believe that this is an educational problem. There are
>> a couple of gotchas users will have to be aware of (and this is
>> unrelated to the methods in question):
>> 
>> * "encoding" always refers to transforming original data into
>>   a derived form
ISTM encoding separates type information from the source and sets it aside
as the identity of the encoding, and renders the data in a composite of
more primitive types, octets being the most primitive short of bits.

>> 
>> * "decoding" always refers to transforming a derived form of
>>   data back into its original form
Decoding of a composite of primitives requires additional separate information
(namely identification of the encoding) to create a higher composite type.
>> 
>> * for Unicode codecs the original form is Unicode, the derived
>>   form is, in most cases, a string
You mean a str instance, right? Where the original type as character vector
is gone. That's not a string in the sense of character string.
>> 
>> As a result, if you want to use a Unicode codec such as utf-8,
>> you encode Unicode into a utf-8 string and decode a utf-8 string
>> into Unicode.
s/string/str instance/
>> 
>> Encoding a string is only possible if the string itself is
>> original data, e.g. some data that is supposed to be transformed
>> into a base64 encoded form.
note what base64 really is for. It's essence is to create a _character_ sequence
which can succeed in being encoded as ascii. The concept of base64 going 
str->str
is really a mental shortcut for s_str.decode('base64').encode('ascii'), where
3 octets are decoded as code for 4 characters modulo padding logic.

>> 
>> Decoding Unicode is only possible if the Unicode string itself
>> represents a derived form, e.g. a sequence of hex literals.
Again, it's an abbreviation, e.g. 
print u'4cf6776973'.encode('hex_chars_to_octets').decode('latin-1')
Should print Löwis

>
>Most of these gotchas would not have been gotchas had encode/decode only
>been usable for unicode encodings.
>
>> > That is why I disagree with the hypergeneralization of the encode/decode
>> > methods
>[..]
>> That's because you only look at one specific task.
>
>> Codecs also unify the various interfaces to common encodings
>> such as base64, uu or zip which are not Unicode related.
I think the trouble is that these view the transformations as octets->octets
whereas IMO decoding should always result in a container type that knows what 
it is
semantically without association with external use-this-codec information. IOW,

octets.decode('zip') -> archive
archive.encode('bzip') -> octets

You could even subclass octet to make archive that knows it's an octet vector
representing a decoded zip, so it can have an encode method that could
(specifying 'zip' again) encode itself back to the original zip, or an alternate
method to encode itself as something else, which you couldn't do from plain 
octets
without specifying both transformations at once. (hence the .recode idea, but I 
don't
think that is as pure. The constructor for the container type could also be 
used, like
Archive(octets, 'zip') analogous to unicode('abc', 'ascii')

IOW 
    octets + decoding info -> container type instance
    container type instance + encoding info -> octets
>
>No, I think you misunderstand. I object to the hypergeneralization of the
>*encode/decode methods*, not the codec system. I would have been fine with
>another set of methods for non-unicode transformations. Although I would
>have been even more fine if they got their encoding not as a string, but as,
>say, a module object, or something imported from a module.
>
>Not that I think any of this matters; we have what we have and I'll have to
>live with it ;)
Probably.
BTW, you may notice I'm saying octet instead of bytes. I have another post on 
that,
arguing that the basic binary information type should be octet, since binary 
files
are made of octets that have no instrinsic numerical or character significance.
See other post if interested ;-)

Regards,
Bengt Richter

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to