Re: [Email-SIG] fixing the current email module

Glenn Linderman Tue, 06 Oct 2009 12:16:05 -0700

On approximately 10/6/2009 7:18 AM, came the following characters fromthe keyboard of Stephen J. Turnbull:

In the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects.


Glenn Linderman writes:

> Email messages are bytes. Usually restricted to bytes in the range> 32-127, but sometimes permitted to be 0-255 (8bit encoding).


This is irrelevant to our internal representation.  It is both trivial
and efficient to convert the wire format (bytes) to a string
internally (at least for email messages up to say 5MB).

Which internal representation makes the most sense depends on what we
are going to do with that internal representation.  At this point I'm
not sure that strings are better than bytes, but I'm quite sure that
I've seen no convincing argument that bytes are TOOWTDI.

Nor is it at all obvious to me that should be stored in wire format.

Yes, I interpreted, possibly misinterpreted, Barry's comment aboutstoring things as bytes, as that he was figuring to store them in wireformat.

 > Using any other format than email format, means knowing how to
 > translate that format to/from email format, and to/from API
 > format... this means coding two translation routines instead of
 > one.

That sound reasonable, but it's a false economy.


And this was actually the point I was trying to make.

The formats you're
talking about here are the transfer encodings, and we need to be able
to decode all of them, and produce all of them.  Internally, they can
be represented by a single format, so you need internal-to-transfer
and transfer-to-internal for about six of them (7bit, 8bit, binary ==
Python bytes, BASE64, quoted-printable, Python string)

Not all formats apply to all MIME types, but I think you've enumeratedthe list.

As for runtime economy, if conversion is done once at parse time and
once at generate time it is not a big burden, not as compared to the
overhead of the Python language itself.

I would tend to agree with that, except that if something isreceived/provided in a particular format, it might want to stay in thatformat until such time it is needed in a different format... and thenthe appropriate set of conversions (current format => internal format =>needed format) applied as needed, avoiding all conversions when it isalready in the needed format.

 > The choice of email RFC byte formats

By "byte format", do you mean "wire format"?


Sure, RFC byte formats == wire format.

 > for the internal form makes it quick and easy to produce a complete
 > message when called for,

Only for certain kinds of messages, such as automated forwards and
signed MIME parts, and cron's messages.  For those, there are great
advantages to spewing things verbatim as you got them off the wire or
the disk.  But even there, as long as we use the natural embedding of
bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
particularly inefficient to use strings.

two conversions are slower than none, and use 2-4 times the space instring format.

For anything else, storing in wire format is going to require checking
format (of the stored data if the format is variable, and always of
the requesting API) on all attribute accesses, and conversions on
many, even most attribute accesses.

One has to write the conversion code anyway; it is just a matter ofwhere it is called. Once converted, meta data could be retained in itsnatural format.

> One problem with storing messages in bytes format: it seems to me that> the choice of which of several legal email bytes formats


None of them are very happy.  The email module needs to be able to
both read and produce all of 7bit, 8bit, and binary, and they are in
fact pretty well trivial to do.

So the question to me is "what are the primary use cases for the email
module, and how do they affect the choice of internal representation?"
I can't claim special expertise on "how", I'll leave that up to
Barry.  Here are some use cases I can think of.


Yes this is a good question.

1.  Debugging programs using the email module.  Maybe that's a +1 for
    internally storing textual data in string form.

2.  MUA #1: Composition.  Input will be strings and multimedia file
    names, output will be bytes.  Will attributes of message objects
    be manipulated?  Not in a conventional MUA, but an email-based MUA
    might find uses for that.

I'm not sure what an email-based MUA is.... seems to me even aconventional MUA is "email-based"???

3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
    data).  Could be strings, though, depending on the internal format
    of folders.  Output will be strings and multimedia objects.  Lots
    of string processing, especially generating folder directory
    displays from message headers.

4.  Mailing list processor.  Message input will be bytes.
    Configuration input, including heading and footer texts that may
    be added are likely to be strings.  Header manipulation (adding
    topics, sequence numbers, RFC 2369 headers) most conveniently done
    with strings.  Output will be bytes.

But the bulk of the message parts, received in wire format, may not needto be altered to be sent along in the same wire format. Headers must bemanipulated somehow, I'd think it would be convenient as strings too.Heading and footing texts are configured boilerplate, and could becached in a variety of formats to avoid the need to convert them foreach message, and could then be obtained from the cache in theappropriate format for this particular message, and prepended orappended as appropriate.

5.  Mailing list archiver.  Input will be bytes or message objects,
    output will be strings (typically HTML documents or XML
    fragments).

An archiver could archive wire format, and do the conversions to *ML onthe fly for those messages that might be accessed that way. Depends onthe expectation of the usage of the archiver... to retrieve the archivedmessages via email, wire format could be extremely efficient; toretrieve via HTTP, one should note that there is very little differencebetween .eml format (another name for wire format) and .mthml format(which is a format IE and Opera will display natively, support in otherbrowsers varies, mostly via addons and conversion utilities). So I'mnot at all sure that this use case requires string output, although someimplementations might prefer it.

6.  Spam/virus detection.  Input may be bytes or message objects.
    Lots of internal string processing; in most cases the text/* parts
    need to be converted to strings before grepping; in some cases
    even images or executables may be reconstituted to look for
    malware signatures.  Output may be a flag or signal, or the
    message itself may be edited (typically to provide headers
    recording degree of spamminess, trace headers, maybe a body
    heading; in some cases, a new message may be generated with the
    suspected spam as a message/rfc822 MIME body part).

So it seems to me that storing the data in the format provided, andconverting it to native format when requested and caching that result,and then when generating wire format, if the needed format was notprovided or cached, then converting as necessary, would be optimal tominimize conversion (time) costs. This technique would also maximallypreserve the original format for use cases 3 and 5, which, for use case3, at least, seems to be important to this list from past discussion.To minimize memory (space) costs, the caching could be avoided (causingreconversion costs), or, at the expense of not preserving the originalformat, once converted, retain only the native format of the item (whichis generally the smallest, for binary objects, and which is most easilymanipulated, but not necessarily smallest, for text objects).


So I'd design the internal format with meta data like

MIMEpart
   formatFlag
   metaData
   7bitData
   8bitData
   binaryData
   nativeText
   nativeBLOB

where the metaData would consist of a variety of pertinent items,obtained by decoding provided wireData or supplied along with providednativeData.

Generate could use 7bitData, 8bitData, or binaryData directly if itexists, or cache it there if it didn't already exist.

binaryData would differ from nativeBLOB only by containing theappropriate MIMEheaders... perhaps as a space optimization, it wouldcontain only the appropriate MIMEheaders, with the binaryData beingplaced in nativeBLOB directly (since this is not a costly conversion,just a choice of where to store the bytes).

It could also be possible that a complete, provided, wire format messagewould be retained as a single BLOB, and the appropriate format dataitems simply be offsets and lengths within that BLOB, although withcached metaData.

Of course, there is already a design within the existing code, and thecost of wholesale redesign may be more than can be afforded.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

Reply via email to