Re: [Python-3000] email libraries: use byte or unicode strings?

Glenn Linderman Wed, 05 Nov 2008 15:40:05 -0800

On approximately 11/5/2008 2:59 PM, came the following characters fromthe keyboard of Andrew McNamara:

I would find

        message[b'Subject'] = b'Hello'
to be totally gross.
While RFC Email is all ASCII, except if 8bit transfer is legal, thereare internal encoding provided that permit the expression of Unicode innearly any component of the email, except for header identifiers. Butthere are never Unicode characters in the transfer, as they always getencoded (there can be UTF-8 byte sequences, of course, if 8bit transferis legal; if it is not, then even UTF-8 byte sequences must be furtherencoded).
Depending on the level of email interface, there should be no interfacethat cannot be expressed in terms of Unicode, plus an encoding to usefor the associated data. Even 8-bit binary can be translated into asequence of Unicode codepoints with the same numeric value, for example.
One significant problem is that the email module is intended to be
able to work with malformed e-mail without mangling it too badly. The
malformed e-mail should also make a round-trip through the email module
without being further mangled.



This is an interesting perspective... "stuff em" does come to mind :)

But I'm not at all clear on what you mean by a round-trip through theemail module. Let me see... if you are creating an email, you (1)should encode it properly (2) a round-trip is mostly meaningless, unlessyou send it to yourself. So you probably mean email that is received,and that you want to send on. In this case, there is already acomposed/encoded form of the email in hand; it could simply be sent asis without decoding or re-encoding. That would be quite a clean round-trip!

I think this requires the underlying processing to be all based on bytes,

Notice that I said _nothing_ about the underlying processing in mycomments, only the API. I fully agree that some, perhaps most, of theunderlying processing has to be aware of bytes, and use and manipulatebytes.

but doesn't preclude layers on top that parse the charset hints. The
rules about encoding are strict, but not always followed. For instance,
the headers *must* be ASCII (the header body can, however, be encoded -

see rfc2047).

Indeed, the headers must be ASCII, and once encoded, the header body isalso.

Spammers often ignore this, and you might be inclined to
say "stuff em'", but this would make the SpamBayes authors rather unhappy.

And so it is quite possible to misinterpret the improperly encodedheaders as 8-bit octets that correspond to Unicode codepoints (theso-called "Latin-1" conversion). For spam, that is certainly goodenough. And roundtripping it says that if APIs are not used to changeit, you use the original binary for that header.

One solution is to provide two sets of classes - the underlying
bytes-based one, and another unicode-based one, built on top of the
bytes classes, that implements the same API, but that may fail due to
encoding errors.



I think you meant "decoding" errors, there?

I guess I'm not terribly concerned about the readability of improperlyencoded email messages, whether they are spam or ham. For the purposesof SpamBayes (which I assume is similar to spamassassin, only written inPython), it doesn't matter if the data is readable, only that it isrecognizably similar. So a consistent mis-transliteration is as good aa correct decoding.

For ham, the correspondent should be informed that there are problemswith their software, so that they can upgrade or reconfigure it. And amis-transliteration is likely the best that can be provided in that caseanyway... unless the mail API provides for ignoring the incoming(incorrect or missing) encoding directives and using one provided by theAPI, and the client can select a few until they stumble on one thatproduces a readable result. But if the mis-transliteration is doneusing the Latin-1 conversion to Unicode, the client, if it chooses towant to do that sort of heuristic analysis, can reencode to Latin-1, andthen decode using some other encoding(s), independently of the mail APIsproviding such a facility.

I do hope to learn and use the Python mail APIs, and I was hoping to dothat in Python 3.0 (and am sorry, but not surprised, to hear that thisis an area of problems at present), and I was hoping that the interfacesthat would be presented by Python 3.0 mail APIs would be in terms ofUnicode, for the convenience of being abstracted away from the plethoraof encodings that are defined at the mail transport layer. (Not that Idon't understand those encodings, but it is something that certainly canand should be mostly hidden under the covers.)


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] email libraries: use byte or unicode strings?

Reply via email to