Josiah Carlson wrote:
> Ron Adam <[EMAIL PROTECTED]> wrote:
> Except that ambiguates it even further.
>
> Is encodings.tounicode() encoding, or decoding? According to everything
> you have said so far, it would be decoding. But if I am decoding binary
> data, why should it be spending any time as a unicode string? What do I
> mean?
Encoding and decoding are relative concepts. It's all encoding from one
thing to another. Weather it's "decoding" or "encoding" depends on the
relationship of the current encoding to a standard encoding.
The confusion introduced by "decode" is when the 'default_encoding'
changes, will change, or is unknown.
> x = f.read() #x contains base-64 encoded binary data
> y = encodings.to_unicode(x, 'base64')
>
> y now contains BINARY DATA, except that it is a unicode string
No, that wasn't what I was describing. You get a Unicode string object
as the result, not a bytes object with binary data. See the toy example
at the bottom.
> z = encodings.to_str(y, 'latin-1')
>
> Later you define a str_to_str function, which I (or someone else) would
> use like:
>
> z = str_to_str(x, 'base64', 'latin-1')
>
> But the trick is that I don't want some unicode string encoded in
> latin-1, I want my binary data unencoded. They may happen to be the
> same in this particular example, but that doesn't mean that it makes any
> sense to the user.
If you want bytes then you would use the bytes() type to get bytes
directly. Not encode or decode.
binary_unicode = bytes(unicode_string)
The exact byte order and representation would need to be decided by the
python developers in this case. The internal representation
'unicode-internal', is UCS-2 I believed.
>> It's no more ambiguous than any math
>> operation where you can do it one way with one operations and get your
>> original value back with the same operation by using an inverse value.
>>
>> n2=n+1; n3=n+(-1); n==n3
>> n2=n*2; n3=n*(.5); n==n3
>
> Ahh, so you are saying 'to_base64' and 'from_base64'. There is one
> major reason why I don't like that kind of a system: I can't just say
> encoding='base64' and use str.encode(encoding) and str.decode(encoding),
> I necessarily have to use, str.recode('to_'+encoding) and
> str.recode('from_'+encoding) . Seems a bit awkward.
Yes, but the encodings API could abstract out the 'to_base64' and
'from_base64' so you can just say 'base64' and have it work either way.
Maybe a toy "incomplete" example might help.
# in module bytes.py or someplace else.
class bytes(list):
"""
bytes methods defined here
"""
# in module encodings.py
# using a dict of lists, but other solutions would
# work just as well.
unicode_codecs = {
'base64': ('from_base64', 'to_base64'),
}
def tounicode(obj, from_codec):
b = bytes(obj)
b = b.recode(unicode_codecs[from_codec][0])
return unicode(b)
def tostr(obj, to_codec):
b = bytes(obj)
b = b.recode(unicode_codecs[to_codec][1])
return str(b)
# in your application
import encodings
... a bunch of code ...
u = encodings.tounicode(s, 'base64')
# or if going the other way
s = encodings.tostr(u, 'base64')
Does this help? Is the relationship between the bytes object and the
encodings API clearer here? If not maybe we should discuss it further
off line.
Cheers,
Ronald Adam
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com