On Fri, 17 Feb 2006 20:33:16 -0800, Josiah Carlson <[EMAIL PROTECTED]> wrote:

>
>Greg Ewing <[EMAIL PROTECTED]> wrote:
>> 
>> Stephen J. Turnbull wrote:
>> >>>>>>"Guido" == Guido van Rossum <[EMAIL PROTECTED]> writes:
>> 
>> >     Guido> - b = bytes(t, enc); t = text(b, enc)
>> > 
>> > +1  The coding conversion operation has always felt like a constructor
>> > to me, and in this particular usage that's exactly what it is.  I
>> > prefer the nomenclature to reflect that.
>> 
>> This also has the advantage that it competely
>> avoids using the verbs "encode" and "decode"
>> and the attendant confusion about which direction
>> they go in.
>> 
>> e.g.
>> 
>>    s = text(b, "base64")
>> 
>> makes it obvious that you're going from the
>> binary side to the text side of the base64
>> conversion.
>
>But you aren't always getting *unicode* text from the decoding of bytes,
>and you may be encoding bytes *to* bytes:
>
>    b2 = bytes(b, "base64")
>    b3 = bytes(b2, "base64")
>
>Which direction are we going again?
Well, base64 is probably not your best example, because it necessarily involves 
characters ;-)

If you are using "base64" you are looking at characters in your input to
produce your bytes output. The only way you can see characters in bytes input
is to decode them. So you are hiding your assumption about b's encoding.

You can make useful rules of inference from type(b), but with bytes you really
don't know. "base64" has to interpret b bytes as characters, because that's what
it needs to recognize base64 characters, to produce the output bytes.

The characters in b could be encoded in plain ascii, or utf16le, you have to 
know.
So for utf16le it should be

     b2 = bytes(text(b, 'utf16le'), "base64")

just because you assume an implicit

     b2 = bytes(text(b, 'ascii'), "base64")

doesn't make it so in general. Even if you build that assumption in,
it's not really true that you are going "bytes *to* bytes" without characters
involved when you do bytes(b, "base64"). You have just left undocumented an API 
restriction
(assert <bytes input is an ascii encoding of base64 characters>) and an 
implementation
optimization ;-)

<rant>
This is the trouble with str.encode and unicode.decode. They both hide implicit
decodes and encodes respectively. They should be banned IMO. Let people spell 
it out
and maybe understand what they are doing.
</rant>

OTOH, a bytes-to-bytes codec might be decompressing tgz into tar. For 
conceptual consistency,
one might define a 'bytes' encoding that conceptually turns bytes into unicode 
byte characters and
vice versa. Then "gunzip" can decode bytes, producing unicode characters which 
are then
encoded back to bytes from the unicode ;-) The 'bytes' encoding would 
numerically be just like
latin-1 except on the unicode side it would have wrapped-bytes internal 
representation.

    b_tar = bytes(text(b_tgz, 'gunzip'), 'bytes')

of course, text(b_tgz, 'gunzip') would produce unicode text with a special 
internal representation that
just wraps bytes though they are true unicode. The 'bytes' codec encode of 
course would just unwrap the
internal bytes representation, but it would conceptually be an encoding into 
bytes. bytes(t, 'latin-1')
would produce the same output from the wrapped bytes unicode.

Sometimes conceptual purity can clarify things and sometimes it's just another 
confusing description.

Regards,
Bengt Richter

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to