On approximately 11/5/2008 4:24 PM, came the following characters from the keyboard of Andrew McNamara:
But I'm not at all clear on what you mean by a round-trip through the email module. Let me see... if you are creating an email, you (1) should encode it properly (2) a round-trip is mostly meaningless, unless you send it to yourself. So you probably mean email that is received, and that you want to send on. In this case, there is already a composed/encoded form of the email in hand; it could simply be sent as is without decoding or re-encoding. That would be quite a clean round-trip!

Imagine a mail proxy of some sort (SMTP or a list manager like Mailman) - you want to be able to parse a message, maybe make some minor changes
(such as adding a "Received:" header, or stripping out illegal MIME types)
and then emit something that differs from the original in only the ways
that you specified.


Sure. Add header, delete header APIs would suffice for this. The APIs could accept Unicode, but do bytes manipulations.


Another example - image what an mail transport agent does with bounces:
it wraps them in a MIME wrapper, but otherwise changes the structure
as little as possible (because that would make later analysis of the
bounce problematic).


So they usually truncate the size too, to 10K or less. Enough to get all the headers. Some only send headers back. So it is no problem. A "retrieve headers in binary from message" API, followed by "add this chunk of binary as a MIME part" to the new bounce message under construction. The first could be replaced by "retrieve message as bytes" and "substr", as an alternative. So yes, some bytes APIs are necessary for binary MIME parts and the whole message (as I mentioned before), and there may be a few other special cases. But mostly, just Unicode.



Notice that I said _nothing_ about the underlying processing in my comments, only the API. I fully agree that some, perhaps most, of the underlying processing has to be aware of bytes, and use and manipulate bytes.

The bytes API has to be accessible - there are many contexts in which
you need to work at this level.


Maybe. I named a couple, you've named another, maybe there are a few more. The only reason not to have a full bytes API is just the effort to support it... if that can reasonably be avoided, why not? But I doubt there are a lot of cases that _must_ be handled as bytes, and so if we can identify the ones that indeed, must be, and supply them, the rest can be Unicode.


Indeed, the headers must be ASCII, and once encoded, the header body is also.

Except when they're not. It's not uncommon in mail handling to get a
valid message that doesn't conform to the specs (not just spam). You can
either throw your hands up in the air and declare it irredeemably broken,
or do your best to extract meaning from it. Invariably, it's the CEO's
best mate who sent the malformed message, so you process it or find a
new job.


This is where you use the Latin-1 conversion. Don't throw an error when in doesn't conform, but don't go to heroic efforts to provide bytes alternatives... just convert the bytes to Unicode, and the way the mail RFCs are written, and the types of encodings used, it is mostly readable. And if it isn't encoded, it is even more readable.


And so it is quite possible to misinterpret the improperly encoded headers as 8-bit octets that correspond to Unicode codepoints (the so-called "Latin-1" conversion). For spam, that is certainly good enough. And roundtripping it says that if APIs are not used to change it, you use the original binary for that header.

Certainly, this is one approach, and users of the email module in the py3k
standard lib are essentially doing this now.


And so how much is it a problem? What are the effects of the problem? Does providing a bytes API solve the problem, or simply punt it to the user? If it simply punts it to the user, are there significant benefits to the coder-user of obtaining the data as bytes, vs. obtaining it as bytes transliterated by the Latin-1 conversion to Unicode? If there are significant benefits to the coder-user, what are they?


One solution is to provide two sets of classes - the underlying
bytes-based one, and another unicode-based one, built on top of the
bytes classes, that implements the same API, but that may fail due to
encoding errors.
I think you meant "decoding" errors, there?

Well, yes and no. I meant that the encoding was done incorrectly.


Sure. The encoding wasn't done correctly, or wasn't done at all. But that causes problems for the decoder, on the receiving side.


I guess I'm not terribly concerned about the readability of improperly encoded email messages, whether they are spam or ham.

You may not be, but other users of the module are.


Sure, but if it isn't properly encoded, then either it is an ASCII superset, in which case the ASCII parts will be readable (at least), and so with a little human cleverness, the non-ASCII parts can be intuited. I'm not suggesting making it worse than what it already is, in bytes form; just to translate the bytes to Unicode codepoints so that they can be returned on a Unicode interface. If you return them in bytes, what would you do besides that? If you would guess at an encoding, and do a different decode, that can be done on the Unicode transliteration just as easily as it can on the bytes form.


For ham, the correspondent should be informed that there are problems with their software, so that they can upgrade or reconfigure it.

How do you determine the correspondent if you can't parse their e-mail? 8-)


Email addresses are pretty standardized in format. Especially the Errors header and the From header. So I think the correspondent's email address will be reasonably interpretable even if their name is not, and the body of their message is not.

I'm not saying all is wonderful if they didn't properly encode their message, but I think you are exaggerating the problem... you can write back to the email address, even if you can't read the message.


(Not that I don't understand those encodings, but it is something that
certainly can and should be mostly hidden under the covers.)

You're talking about a utopian state that Unicode strives but fails to achieve.


Messages that are properly encoded can certainly achieve the Utopian state under the covers.

Messages that are not properly encoded can be assumed to be Latin-1, and converted to Unicode. They may not be perfectly readable in that state, but face it, non-Unicode email clients did exactly that, but used Latin-1 bytes directly (or some other encoding). And if you think it would be helpful to have the default conversion to Unicode use some code page other than Latin-1, such as the currently configured code page, that is a fine alternative... and again, is much what happens today when people communicate without doing the proper encoding. Two people that use the same code page can communicate in that code page, but communicating with people that use other code pages is problematical.

So no, Unicode doesn't solve the problems with buggy software, but it can be used without making the problem worse, so using it generally makes for a more convenient API.

Think about the coder of the Python-based email client. Given the alternatives to use the Unicode API or the bytes API, how are they going to choose to use one or the other? Code the application twice, once with each API? No way! Too much work!

So they'll use the Unicode API for text, and the bytes APIs for binary attachments, because that is what is natural.

If improperly encoded messages are received, and appropriate transliterations are made so that the bytes get converted (default code page) or passed through (Latin-1 transformation), then the data may be somewhat garbled for characters in the non-ASCII subset. But that is not different than the handling done by any 8-bit email client, nor, I suspect (a little uncertainty here) different than the handling done by Python < 3.0 mail libraries.

So that is not Utopian; Utopia can only be reached by following standards. But I don't see it as terrible; it is no worse that what happens today when the standards are not followed.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to