On Apr 9, 2009, at 8:07 AM, Steve Holden wrote:

The real problem I came across in storing email in a relational database
was the inability to store messages as Unicode. Some messages have a
body in one encoding and an attachment in another, so the only ways to
store the messages are either as a monolithic bytes string that gets
parsed when the individual components are required or as a sequence of
components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless
you store the components as bytes).

All in all, as you might expect from a system that's been growing up
since 1970 or so, it can be quite intractable.

There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/ * types and bytes for anything else (not counting multiparts).

The email package isn't a perfect mapping to this, which is something I want to improve. That aside, I think storing a message in a database means storing some or all of the headers separately from the byte stream (or text?) of its payload. That's for non-multipart types. It would be more complicated to represent a message tree of course.

It does seem to make sense to think about headers as text header names and text header values. Of course, header values can contain almost anything and there's an encoding to bring it back to 7-bit ASCII, but again, you really have two views of a header value. Which you want really depends on your application.

Maybe you just care about the text of both the header name and value. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated.

Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x

-Barry

Attachment: PGP.sig
Description: This is a digitally signed message part

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to