On approximately 11/5/2008 2:59 PM, came the following characters from the keyboard of Andrew McNamara:
I would find

        message[b'Subject'] = b'Hello'

to be totally gross.

While RFC Email is all ASCII, except if 8bit transfer is legal, there are internal encoding provided that permit the expression of Unicode in nearly any component of the email, except for header identifiers. But there are never Unicode characters in the transfer, as they always get encoded (there can be UTF-8 byte sequences, of course, if 8bit transfer is legal; if it is not, then even UTF-8 byte sequences must be further encoded).

Depending on the level of email interface, there should be no interface that cannot be expressed in terms of Unicode, plus an encoding to use for the associated data. Even 8-bit binary can be translated into a sequence of Unicode codepoints with the same numeric value, for example.

One significant problem is that the email module is intended to be
able to work with malformed e-mail without mangling it too badly. The
malformed e-mail should also make a round-trip through the email module
without being further mangled.


This is an interesting perspective... "stuff em" does come to mind :)

But I'm not at all clear on what you mean by a round-trip through the email module. Let me see... if you are creating an email, you (1) should encode it properly (2) a round-trip is mostly meaningless, unless you send it to yourself. So you probably mean email that is received, and that you want to send on. In this case, there is already a composed/encoded form of the email in hand; it could simply be sent as is without decoding or re-encoding. That would be quite a clean round-trip!


I think this requires the underlying processing to be all based on bytes,

Notice that I said _nothing_ about the underlying processing in my comments, only the API. I fully agree that some, perhaps most, of the underlying processing has to be aware of bytes, and use and manipulate bytes.


but doesn't preclude layers on top that parse the charset hints. The
rules about encoding are strict, but not always followed. For instance,
the headers *must* be ASCII (the header body can, however, be encoded -
see rfc2047).


Indeed, the headers must be ASCII, and once encoded, the header body is also.


Spammers often ignore this, and you might be inclined to
say "stuff em'", but this would make the SpamBayes authors rather unhappy.


And so it is quite possible to misinterpret the improperly encoded headers as 8-bit octets that correspond to Unicode codepoints (the so-called "Latin-1" conversion). For spam, that is certainly good enough. And roundtripping it says that if APIs are not used to change it, you use the original binary for that header.


One solution is to provide two sets of classes - the underlying
bytes-based one, and another unicode-based one, built on top of the
bytes classes, that implements the same API, but that may fail due to
encoding errors.


I think you meant "decoding" errors, there?

I guess I'm not terribly concerned about the readability of improperly encoded email messages, whether they are spam or ham. For the purposes of SpamBayes (which I assume is similar to spamassassin, only written in Python), it doesn't matter if the data is readable, only that it is recognizably similar. So a consistent mis-transliteration is as good a a correct decoding.

For ham, the correspondent should be informed that there are problems with their software, so that they can upgrade or reconfigure it. And a mis-transliteration is likely the best that can be provided in that case anyway... unless the mail API provides for ignoring the incoming (incorrect or missing) encoding directives and using one provided by the API, and the client can select a few until they stumble on one that produces a readable result. But if the mis-transliteration is done using the Latin-1 conversion to Unicode, the client, if it chooses to want to do that sort of heuristic analysis, can reencode to Latin-1, and then decode using some other encoding(s), independently of the mail APIs providing such a facility.

I do hope to learn and use the Python mail APIs, and I was hoping to do that in Python 3.0 (and am sorry, but not surprised, to hear that this is an area of problems at present), and I was hoping that the interfaces that would be presented by Python 3.0 mail APIs would be in terms of Unicode, for the convenience of being abstracted away from the plethora of encodings that are defined at the mail transport layer. (Not that I don't understand those encodings, but it is something that certainly can and should be mostly hidden under the covers.)

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to