Thomas Wouters wrote:
> On Sat, Feb 18, 2006 at 12:06:37PM +0100, M.-A. Lemburg wrote:
> 
>> I've already explained why we have .encode() and .decode()
>> methods on strings and Unicode many times. I've also
>> explained the misunderstanding that can codecs only do
>> Unicode-string conversions. And I've explained that
>> the .encode() and .decode() method *do* check the return
>> types of the codecs and only allow strings or Unicode
>> on return (no lists, instances, tuples or anything else).
>>
>> You seem to ignore this fact.
> 
> Actually, I think the problem is that while we all agree the
> bytestring/unicode methods are a useful way to convert from bytestring to
> unicode and back again, we disagree on their *general* usefulness. Sure, the
> codecs mechanism is powerful, and even more so because they can determine
> their own returntype. But it still smells and feels like a Perl attitude,
> for the reasons already explained numerous times, as well:

It's by no means a Perl attitude.

The main reason is symmetry and the fact that strings and Unicode
should be as similar as possible in order to simplify the task of
moving from one to the other.

>  - The return value for the non-unicode encodings depends on the value of
>    the encoding argument.

Not really: you'll always get a basestring instance.

>  - The general case, by and large, especially in non-powerusers, is to
>    encode unicode to bytestrings and to decode bytestrings to unicode. And
>    that is a hard enough task for many of the non-powerusers. Being able to
>    use the encode/decode methods for other tasks isn't helping them.

Agreed.

Still, I believe that this is an educational problem. There are
a couple of gotchas users will have to be aware of (and this is
unrelated to the methods in question):

* "encoding" always refers to transforming original data into
  a derived form

* "decoding" always refers to transforming a derived form of
  data back into its original form

* for Unicode codecs the original form is Unicode, the derived
  form is, in most cases, a string

As a result, if you want to use a Unicode codec such as utf-8,
you encode Unicode into a utf-8 string and decode a utf-8 string
into Unicode.

Encoding a string is only possible if the string itself is
original data, e.g. some data that is supposed to be transformed
into a base64 encoded form.

Decoding Unicode is only possible if the Unicode string itself
represents a derived form, e.g. a sequence of hex literals.

> That is why I disagree with the hypergeneralization of the encode/decode
> methods, regardless of the fact that it is a natural expansion of the
> implementation of codecs. Sure, it looks 'right' and 'natural' when you look
> at the implementation. It sure doesn't look natural, to me and to many
> others, when you look at the task of encoding and decoding
> bytestrings/unicode.

That's because you only look at one specific task.

Codecs also unify the various interfaces to common encodings
such as base64, uu or zip which are not Unicode related.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to