On Oct 6, 2009, at 10:18 AM, Stephen J. Turnbull wrote:

In the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects.

Exactly.  8-bit strings are dead to us.

for the internal form makes it quick and easy to produce a complete
message when called for,

Only for certain kinds of messages, such as automated forwards and
signed MIME parts, and cron's messages.  For those, there are great
advantages to spewing things verbatim as you got them off the wire or
the disk.  But even there, as long as we use the natural embedding of
bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
particularly inefficient to use strings.

For anything else, storing in wire format is going to require checking
format (of the stored data if the format is variable, and always of
the requesting API) on all attribute accesses, and conversions on
many, even most attribute accesses.

I think that's going to be the case either way. Some applications are going to want bytes, others strings, so there needs to be APIs for both.

So the question to me is "what are the primary use cases for the email
module, and how do they affect the choice of internal representation?"
I can't claim special expertise on "how", I'll leave that up to
Barry.  Here are some use cases I can think of.

1.  Debugging programs using the email module.  Maybe that's a +1 for
   internally storing textual data in string form.

2.  MUA #1: Composition.  Input will be strings and multimedia file
   names, output will be bytes.  Will attributes of message objects
   be manipulated?  Not in a conventional MUA, but an email-based MUA
   might find uses for that.

3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
   data).  Could be strings, though, depending on the internal format
   of folders.  Output will be strings and multimedia objects.  Lots
   of string processing, especially generating folder directory
   displays from message headers.

4.  Mailing list processor.  Message input will be bytes.
   Configuration input, including heading and footer texts that may
   be added are likely to be strings.  Header manipulation (adding
   topics, sequence numbers, RFC 2369 headers) most conveniently done
   with strings.  Output will be bytes.

5.  Mailing list archiver.  Input will be bytes or message objects,
   output will be strings (typically HTML documents or XML
   fragments).

6.  Spam/virus detection.  Input may be bytes or message objects.
   Lots of internal string processing; in most cases the text/* parts
   need to be converted to strings before grepping; in some cases
   even images or executables may be reconstituted to look for
   malware signatures.  Output may be a flag or signal, or the
   message itself may be edited (typically to provide headers
   recording degree of spamminess, trace headers, maybe a body
   heading; in some cases, a new message may be generated with the
   suspected spam as a message/rfc822 MIME body part).

I think this is a very good list. The key thing from an application's point of view is that sometimes messages are parsed and sometimes they are crafted. When parsed, the raw input can come from a completely unknown and untrusted source such as the puking mouth of an MTA. Other times it comes from a big blob of string in a doctest. When crafted, it's almost always a program building up a message tree from scratch, or possibly the manipulation of an existing message (e.g. MIME filter).

-Barry

Attachment: PGP.sig
Description: This is a digitally signed message part

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to