On Fri, 17 Feb 2006 20:33:16 -0800, Josiah Carlson <[EMAIL PROTECTED]> wrote:
>
>Greg Ewing <[EMAIL PROTECTED]> wrote:
>>
>> Stephen J. Turnbull wrote:
>> >>>>>>"Guido" == Guido van Rossum <[EMAIL PROTECTED]> writes:
>>
>> > Guido> - b = bytes(t, enc); t = text(b, enc)
>> >
>> > +1 The coding conversion operation has always felt like a constructor
>> > to me, and in this particular usage that's exactly what it is. I
>> > prefer the nomenclature to reflect that.
>>
>> This also has the advantage that it competely
>> avoids using the verbs "encode" and "decode"
>> and the attendant confusion about which direction
>> they go in.
>>
>> e.g.
>>
>> s = text(b, "base64")
>>
>> makes it obvious that you're going from the
>> binary side to the text side of the base64
>> conversion.
>
>But you aren't always getting *unicode* text from the decoding of bytes,
>and you may be encoding bytes *to* bytes:
>
> b2 = bytes(b, "base64")
> b3 = bytes(b2, "base64")
>
>Which direction are we going again?
Well, base64 is probably not your best example, because it necessarily involves
characters ;-)
If you are using "base64" you are looking at characters in your input to
produce your bytes output. The only way you can see characters in bytes input
is to decode them. So you are hiding your assumption about b's encoding.
You can make useful rules of inference from type(b), but with bytes you really
don't know. "base64" has to interpret b bytes as characters, because that's what
it needs to recognize base64 characters, to produce the output bytes.
The characters in b could be encoded in plain ascii, or utf16le, you have to
know.
So for utf16le it should be
b2 = bytes(text(b, 'utf16le'), "base64")
just because you assume an implicit
b2 = bytes(text(b, 'ascii'), "base64")
doesn't make it so in general. Even if you build that assumption in,
it's not really true that you are going "bytes *to* bytes" without characters
involved when you do bytes(b, "base64"). You have just left undocumented an API
restriction
(assert <bytes input is an ascii encoding of base64 characters>) and an
implementation
optimization ;-)
<rant>
This is the trouble with str.encode and unicode.decode. They both hide implicit
decodes and encodes respectively. They should be banned IMO. Let people spell
it out
and maybe understand what they are doing.
</rant>
OTOH, a bytes-to-bytes codec might be decompressing tgz into tar. For
conceptual consistency,
one might define a 'bytes' encoding that conceptually turns bytes into unicode
byte characters and
vice versa. Then "gunzip" can decode bytes, producing unicode characters which
are then
encoded back to bytes from the unicode ;-) The 'bytes' encoding would
numerically be just like
latin-1 except on the unicode side it would have wrapped-bytes internal
representation.
b_tar = bytes(text(b_tgz, 'gunzip'), 'bytes')
of course, text(b_tgz, 'gunzip') would produce unicode text with a special
internal representation that
just wraps bytes though they are true unicode. The 'bytes' codec encode of
course would just unwrap the
internal bytes representation, but it would conceptually be an encoding into
bytes. bytes(t, 'latin-1')
would produce the same output from the wrapped bytes unicode.
Sometimes conceptual purity can clarify things and sometimes it's just another
confusing description.
Regards,
Bengt Richter
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com