Re: [Python-3000] email libraries: use byte or unicode strings?

Barry Warsaw Thu, 06 Nov 2008 09:02:22 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Nov 5, 2008, at 6:39 PM, Glenn Linderman wrote:

This is an interesting perspective... "stuff em" does come to mind :)
But I'm not at all clear on what you mean by a round-trip throughthe email module. Let me see... if you are creating an email, you(1) should encode it properly (2) a round-trip is mostlymeaningless, unless you send it to yourself. So you probably meanemail that is received, and that you want to send on. In this case,there is already a composed/encoded form of the email in hand; itcould simply be sent as is without decoding or re-encoding. Thatwould be quite a clean round-trip!

There are two ways to create an email DOM. One is out of whole cloth(i.e. creating Message objects and their subclasses, then attachingthem into a tree). Note that it is a "generator" whose job it is totake the DOM and produce an RFC-compliant flat textural representation.

The other way to get a DOM is to parse some flat textualrepresentation. In this case, it is a core design requirement thatthe parser never throws an exception, and that there is a way torecord and retrieve the defects in a message.

The core model objects of Message (and their MIME subclasses) andHeader should treat everything internally as bytes. The edges arewhere you want to be able to accept varying types, but always convertto bytes internally. Edges of this system include the parser, thegenerator, and various setter and getter methods of Message and Header.

The current code has a strong desire to be idempotent, so that parser->DOM->generator output is exactly the same as input. Small changesto the DOM or content in between should have minimal effect. Forexample, if you delete a header and then add it back, the header willshow up at the end of the RFC 2822 header list, but everything elseabout the message will be unchanged.

Currently idempotency is broken for defective messages. The generatoris guaranteed to produce RFC-compliant output, repairing defects likemissing boundaries and such.

I guess I'm not terribly concerned about the readability ofimproperly encoded email messages, whether they are spam or ham.For the purposes of SpamBayes (which I assume is similar tospamassassin, only written in Python), it doesn't matter if the datais readable, only that it is recognizably similar. So a consistentmis-transliteration is as good a a correct decoding.

The key thing is that parse should never ever raise an exception.We've learned the hard way that this is the most practical thingbecause at the level most parsing happens, you really cannot handleany errors.

For ham, the correspondent should be informed that there areproblems with their software, so that they can upgrade orreconfigure it.

That's a practical impossibility in real-world applications, as issimply discarding malformed messages. Email sucks.


- -Barry


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSRMjE3EjvBPtnXfVAQKMYAP/VbzETAnCegJavJ4zIB37hbWBWmp4yClY
RRzdTXQQY8VxFioxlVwHaxa7AHW/xADsFEkOsm0saWnld4pbu9m00T6KccAOp3eY
BbqXUixFRR6DmyiuLk+0F/cBlgnPH8y3XnlTXsEdXS2za5tW6YoyCsfTu9xGl0Qp
aC7ta6xcvNk=
=NgCu
-----END PGP SIGNATURE-----
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] email libraries: use byte or unicode strings?

Reply via email to