Re: [Python-Dev] email package status in 3.X

Stephen J. Turnbull Mon, 21 Jun 2010 22:22:15 -0700

P.J. Eby writes:

 > In Kagoshima, you'd use pass in an ebytes with your encoding to a
 > stdlib API, and *get back an ebytes with the right encoding*,
 > rather than an (incorrect and useless) unicode object which has
 > lost data you need.


How does the stdlib do that?  Unless it guesses which encoding for
Japanese is being used?  And even if this ebytes uses Shift JIS, what
makes that the "right" encoding for anything?

On the other hand, I know when *I* need some encoding, and when I
figure it out I will store it in an appropriate place in my program.
The problem is that for some programs it is not unlikely that I will
see all of Shift JIS, EUC-JP, ISO-2022-JP, UTF-8, and UTF-16, and on a
very bad day, RFC 2047, GB 2312, and Big5, too, used to encode
Japanese.  It's not totally unlikely for a browser to send URLs to a
server expecting UTF-8 to recover a message/rfc822 object containing
ISO-2022-JP in the mail header and EUC-JP in the body.

So I need to know which encoding was used by the server that sent the
reply, but the ebytes can't tell me that if it fishes an URL in EUC-JP
out of the message body.  I need to convert that URL to UTF-8, or most
servers will 404.

 > But this is not the case at all, for use cases where "no, really, you 
 > *have to* work with bytes-encoded text streams".  The mere release of 
 > Python 3.x will not cause all the world's applications, libraries, 
 > and protocols to suddenly work with unicode, where they did not before.

Sure.  That's what .encode() and .decode() are for.  The problem is
what to do when you don't know what to put in the parentheses, and I
can't think of a use case offhand where ebytes(stuff,'garbage')
does better than PEP 383-enabled str for:

 > Being explicit about the encoding of the bytes you're flinging
 > around is actually an *increase* in specificity, explicitness,
 > robustness, and error-checking ability over the status quo for
 > either 2.x *or* 3.x...  *and* it improves these qualities for
 > essentially *all* string-handling code, without requiring that code
 > to be rewritten to do so.

A well-spoken piece.  But, you see, most of those encodings are *only*
interesting so that you can transcode characters to the encoding of
interest.  What's the e.o.i.?  That is easily found in the context or
has an obvious default, if you're lucky, or otherwise a hard problem
that ebytes does nothing to help solve as far as I can see.

Cf. Robert Collins' post
<aanlktinq_d_vahbw5ikuyy9qgjqoffy4xczc0dyzt...@mail.gmail.com>, where
he makes it quite explicit that a bytes interface is all about punting
in the face of missing encoding information.

 > >and (2) you really want this under control of higher level objects
 > >that have access to some knowledge of the environment, rather than
 > >the lowest level.
 > 
 > This proposal actually has such a higher-level object: an 
 > ebytes.

I don't see how that can be true.  An ebytes is a very low-level
object that has no idea whether its encoding is interesting (eg, the
one that an RFC or a server specifies), or a technical detail of use
only until the ebytes is decoded, then can be thrown away.

I just don't see, in the case where there is a real encoding in the
ebytes, what harm is done by decoding the ebytes to str.  If context
indicates that the encoding is an interesting one (eg, it should be
the default for encoding on output), then you want to save that in an
appropriate place that preserves not just the encoding itself, but the
context that gives it its importance.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

Reply via email to