Re: [Python-3000] email libraries: use byte or unicode strings?

Glenn Linderman Wed, 05 Nov 2008 21:34:24 -0800

On approximately 11/5/2008 4:24 PM, came the following characters fromthe keyboard of Andrew McNamara:

But I'm not at all clear on what you mean by a round-trip through theemail module. Let me see... if you are creating an email, you (1)should encode it properly (2) a round-trip is mostly meaningless, unlessyou send it to yourself. So you probably mean email that is received,and that you want to send on. In this case, there is already acomposed/encoded form of the email in hand; it could simply be sent asis without decoding or re-encoding. That would be quite a clean round-trip!
Imagine a mail proxy of some sort (SMTP or a list manager like Mailman) -you want to be able to parse a message, maybe make some minor changes
(such as adding a "Received:" header, or stripping out illegal MIME types)
and then emit something that differs from the original in only the ways
that you specified.

Sure. Add header, delete header APIs would suffice for this. The APIscould accept Unicode, but do bytes manipulations.

Another example - image what an mail transport agent does with bounces:
it wraps them in a MIME wrapper, but otherwise changes the structure
as little as possible (because that would make later analysis of the
bounce problematic).

So they usually truncate the size too, to 10K or less. Enough to getall the headers. Some only send headers back. So it is no problem. A"retrieve headers in binary from message" API, followed by "add thischunk of binary as a MIME part" to the new bounce message underconstruction. The first could be replaced by "retrieve message asbytes" and "substr", as an alternative. So yes, some bytes APIs arenecessary for binary MIME parts and the whole message (as I mentionedbefore), and there may be a few other special cases. But mostly, justUnicode.

Notice that I said _nothing_ about the underlying processing in mycomments, only the API. I fully agree that some, perhaps most, of theunderlying processing has to be aware of bytes, and use and manipulatebytes.
The bytes API has to be accessible - there are many contexts in which
you need to work at this level.

Maybe. I named a couple, you've named another, maybe there are a fewmore. The only reason not to have a full bytes API is just the effortto support it... if that can reasonably be avoided, why not? But Idoubt there are a lot of cases that _must_ be handled as bytes, and soif we can identify the ones that indeed, must be, and supply them, therest can be Unicode.

Indeed, the headers must be ASCII, and once encoded, the header body isalso.


Except when they're not. It's not uncommon in mail handling to get a
valid message that doesn't conform to the specs (not just spam). You can
either throw your hands up in the air and declare it irredeemably broken,
or do your best to extract meaning from it. Invariably, it's the CEO's
best mate who sent the malformed message, so you process it or find a
new job.

This is where you use the Latin-1 conversion. Don't throw an error whenin doesn't conform, but don't go to heroic efforts to provide bytesalternatives... just convert the bytes to Unicode, and the way the mailRFCs are written, and the types of encodings used, it is mostlyreadable. And if it isn't encoded, it is even more readable.

And so it is quite possible to misinterpret the improperly encodedheaders as 8-bit octets that correspond to Unicode codepoints (theso-called "Latin-1" conversion). For spam, that is certainly goodenough. And roundtripping it says that if APIs are not used to changeit, you use the original binary for that header.
Certainly, this is one approach, and users of the email module in the py3k
standard lib are essentially doing this now.

And so how much is it a problem? What are the effects of the problem?Does providing a bytes API solve the problem, or simply punt it to theuser? If it simply punts it to the user, are there significant benefitsto the coder-user of obtaining the data as bytes, vs. obtaining it asbytes transliterated by the Latin-1 conversion to Unicode? If there aresignificant benefits to the coder-user, what are they?

One solution is to provide two sets of classes - the underlying
bytes-based one, and another unicode-based one, built on top of the
bytes classes, that implements the same API, but that may fail due to
encoding errors.

I think you meant "decoding" errors, there?


Well, yes and no. I meant that the encoding was done incorrectly.

Sure. The encoding wasn't done correctly, or wasn't done at all. Butthat causes problems for the decoder, on the receiving side.

I guess I'm not terribly concerned about the readability of improperlyencoded email messages, whether they are spam or ham.
You may not be, but other users of the module are.

Sure, but if it isn't properly encoded, then either it is an ASCIIsuperset, in which case the ASCII parts will be readable (at least), andso with a little human cleverness, the non-ASCII parts can beintuited. I'm not suggesting making it worse than what it already is,in bytes form; just to translate the bytes to Unicode codepoints so thatthey can be returned on a Unicode interface. If you return them inbytes, what would you do besides that? If you would guess at anencoding, and do a different decode, that can be done on the Unicodetransliteration just as easily as it can on the bytes form.

For ham, the correspondent should be informed that there are problemswith their software, so that they can upgrade or reconfigure it.
How do you determine the correspondent if you can't parse their e-mail? 8-)

Email addresses are pretty standardized in format. Especially theErrors header and the From header. So I think the correspondent's emailaddress will be reasonably interpretable even if their name is not, andthe body of their message is not.

I'm not saying all is wonderful if they didn't properly encode theirmessage, but I think you are exaggerating the problem... you can writeback to the email address, even if you can't read the message.

(Not that I don't understand those encodings, but it is something that
certainly can and should be mostly hidden under the covers.)


You're talking about a utopian state that Unicode strives but fails to achieve.

Messages that are properly encoded can certainly achieve the Utopianstate under the covers.

Messages that are not properly encoded can be assumed to be Latin-1, andconverted to Unicode. They may not be perfectly readable in that state,but face it, non-Unicode email clients did exactly that, but usedLatin-1 bytes directly (or some other encoding). And if you think itwould be helpful to have the default conversion to Unicode use some codepage other than Latin-1, such as the currently configured code page,that is a fine alternative... and again, is much what happens today whenpeople communicate without doing the proper encoding. Two people thatuse the same code page can communicate in that code page, butcommunicating with people that use other code pages is problematical.

So no, Unicode doesn't solve the problems with buggy software, but itcan be used without making the problem worse, so using it generallymakes for a more convenient API.

Think about the coder of the Python-based email client. Given thealternatives to use the Unicode API or the bytes API, how are they goingto choose to use one or the other? Code the application twice, oncewith each API? No way! Too much work!

So they'll use the Unicode API for text, and the bytes APIs for binaryattachments, because that is what is natural.

If improperly encoded messages are received, and appropriatetransliterations are made so that the bytes get converted (default codepage) or passed through (Latin-1 transformation), then the data may besomewhat garbled for characters in the non-ASCII subset. But that isnot different than the handling done by any 8-bit email client, nor, Isuspect (a little uncertainty here) different than the handling done byPython < 3.0 mail libraries.

So that is not Utopian; Utopia can only be reached by followingstandards. But I don't see it as terrible; it is no worse that whathappens today when the standards are not followed.



--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] email libraries: use byte or unicode strings?

Reply via email to