Tony Nelson wrote: > (email-sig added) > > At 08:07 -0400 04/09/2009, Steve Holden wrote: >> Barry Warsaw wrote: > ... >>> This is an interesting question, and something I'm struggling with for >>> the email package for 3.x. It turns out to be pretty convenient to have >>> both a bytes and a string API, both for input and output, but I think >>> email really wants to be represented internally as bytes. Maybe. Or >>> maybe just for content bodies and not headers, or maybe both. Anyway, >>> aside from that decision, I haven't come up with an elegant way to allow >>> /output/ in both bytes and strings (input is I think theoretically >>> easier by sniffing the arguments). >>> >> The real problem I came across in storing email in a relational database >> was the inability to store messages as Unicode. Some messages have a >> body in one encoding and an attachment in another, so the only ways to >> store the messages are either as a monolithic bytes string that gets >> parsed when the individual components are required or as a sequence of >> components in the database's preferred encoding (if you want to keep the >> original encoding most relational databases won't be able to help unless >> you store the components as bytes). > ... > > I found it confusing myself, and did it wrong for a while. Now, I > understand that essages come over the wire as bytes, either 7-bit US-ASCII > or 8-bit whatever, and are parsed at the receiver. I think of the database > as a wire to the future, and store the data as bytes (a BLOB), letting the > future receiver parse them as it did the first time, when I cleaned the > message. Data I care to query is extracted into fields (in UTF-8, what I > usually use for char fields). I have no need to store messages as Unicode, > and they aren't Unicode anyway. I have no need ever to flatten a message > to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw > 8-bit data. > > If you need the data from the message, by all means extract it and store it > in whatever form is useful to the purpose of the database. If you need the > entire message, store it intact in the database, as the bytes it is. Email > isn't Unicode any more than a JPEG or other image types (often payloads in > a message) are Unicode.
This is all great, and I did quite quickly realize that the best approach was to store the mails in their network byte-stream format as bytes. The approach was negated in my own case because of PostgreSQL's execrable BLOB-handling capabilities. I took a look at the escaping they required, snorted with derision and gave it up as a bad job. PostgreSQL strongly encourages you to store text as encoded columns. Because emails lack an encoding it turns out this is a most inconvenient storage type for it. Sadly BLOBs are such a pain in PostgreSQL that it's easier to store the messages in external files and just use the relational database to index those files to retrieve content, so that's what I ended up doing. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com