Re: [Python-3000] email libraries: use byte or unicode strings?

Stephen J. Turnbull Thu, 06 Nov 2008 03:56:42 -0800

Glenn Linderman writes:

 > There is no reference to the word emacs or types in any of the messages 
 > you've posted in this thread, maybe you are referring to another thread 
 > somewhere?  Sorry, I'm new to this party, but I have read the whole 
 > thread... unless my mail reader has missed part of it.

I'm sorry, you are right; the relevant message was never sent.  Here
it is; I've looked it over briefly and it seems intelligible, but from
your point of view it may seem out of context now.

Glenn Linderman writes:

 > This is where you use the Latin-1 conversion.  Don't throw an error
 > when in doesn't conform, but don't go to heroic efforts to provide
 > bytes alternatives... just convert the bytes to Unicode, and the
 > way the mail RFCs are written, and the types of encodings used, it
 > is mostly readable.  And if it isn't encoded, it is even more
 > readable.

This is what XEmacs/Mule does.  It's a PITA for everybody (except the
Mule implementers, whose life is dramatically simplified by punting
this way).  For one thing, what's readable to a human being may be
death to a subprogram that expects valid MIME.  GNU Emacs is even
worse; it does provide both a bytes-like type and a unicode-like type,
but then it turns around and provides a way to "cast" unicodes to
bytes and vice-versa, thus exposing implementation in an unclean (and
often buggy) way.

 > And so how much is it a problem?  What are the effects of the problem?

In Emacs, the problem is that strings that are punted get concatenated
with strings that are properly decoded, and when reencoding is
attempted, you get garbage or a coding error.  Since Mule discarded
the type (punt vs. decode) information, the app loses.  There's no way
to recover.  The apps most at risk are things like MUAs (which Emacs
does well) and web browsers (which it doesn't), and even AUCTeX (a
mode for handling LaTeX documents---TeX is not Unicode-aware so its
error messages are frequently truncated in the middle of a UTF-8
character) and they go to great lengths to keep track of what is valid
and what is not in the app.  They don't always succeed.  I think Emacs
should be doing this for them, somehow (and I'm an XEmacs implementer,
not an MUA implementer!)

The situation in Python will be strongly analogous, I believe.

 > I'm not suggesting making it worse than what it already is, in
 > bytes form; just to translate the bytes to Unicode codepoints so
 > that they can be returned on a Unicode interface.

Which *does* make it worse, unless you enforce a type difference so
that punted strings can't be mixed with decoded strings without
effort.  That type difference may as well be bytes vs. Unicode as some
subclass of Unicode vs. Unicode.

"Why would you mix strings?"  Well, for one example there are multiple
address headers which get collected into an addressee list for purpose
of constructing a reply.  If one of the headers is broken and another
is not, you get mixed mode.  The same thing can happen for
multilingual message bodies: they get split into a multipart with
different charsets for different parts, and if one is broken but
another is not, you get mixed mode.

 > So they'll use the Unicode API for text, and the bytes APIs for binary 
 > attachments, because that is what is natural.

Well, as I see it there won't be bytes APIs for text.  The APIs will
return Unicode text if they succeed, and raise an error if not.  If
the error is caught, the offending object will be available as bytes.

 > If improperly encoded messages are received, and appropriate 
 > transliterations are made so that the bytes get converted (default code 
 > page) or passed through (Latin-1 transformation), then the data may be 
 > somewhat garbled for characters in the non-ASCII subset.  But that is 
 > not different than the handling done by any 8-bit email client, nor, I 
 > suspect (a little uncertainty here) different than the handling done by 
 > Python < 3.0 mail libraries.

Which is exactly how we got to this point.  Experience with GNU
Mailman and other such applications indicate that the implementation
in the existing Python email module needs work, and Barry Warsaw and
others who have tried to work on it say that it's not that easy, and
that the API may need to change to accomodate needed changes in the
implementation.

_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] email libraries: use byte or unicode strings?

Reply via email to