On approximately 10/6/2009 7:18 AM, came the following characters from the keyboard of Stephen J. Turnbull:
In the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects.

Glenn Linderman writes:

> Email messages are bytes. Usually restricted to bytes in the range > 32-127, but sometimes permitted to be 0-255 (8bit encoding).

This is irrelevant to our internal representation.  It is both trivial
and efficient to convert the wire format (bytes) to a string
internally (at least for email messages up to say 5MB).

Which internal representation makes the most sense depends on what we
are going to do with that internal representation.  At this point I'm
not sure that strings are better than bytes, but I'm quite sure that
I've seen no convincing argument that bytes are TOOWTDI.

Nor is it at all obvious to me that should be stored in wire format.

Yes, I interpreted, possibly misinterpreted, Barry's comment about storing things as bytes, as that he was figuring to store them in wire format.


 > Using any other format than email format, means knowing how to
 > translate that format to/from email format, and to/from API
 > format... this means coding two translation routines instead of
 > one.

That sound reasonable, but it's a false economy.

And this was actually the point I was trying to make.


The formats you're
talking about here are the transfer encodings, and we need to be able
to decode all of them, and produce all of them.  Internally, they can
be represented by a single format, so you need internal-to-transfer
and transfer-to-internal for about six of them (7bit, 8bit, binary ==
Python bytes, BASE64, quoted-printable, Python string)

Not all formats apply to all MIME types, but I think you've enumerated the list.

As for runtime economy, if conversion is done once at parse time and
once at generate time it is not a big burden, not as compared to the
overhead of the Python language itself.

I would tend to agree with that, except that if something is received/provided in a particular format, it might want to stay in that format until such time it is needed in a different format... and then the appropriate set of conversions (current format => internal format => needed format) applied as needed, avoiding all conversions when it is already in the needed format.

 > The choice of email RFC byte formats

By "byte format", do you mean "wire format"?

Sure, RFC byte formats == wire format.

 > for the internal form makes it quick and easy to produce a complete
 > message when called for,

Only for certain kinds of messages, such as automated forwards and
signed MIME parts, and cron's messages.  For those, there are great
advantages to spewing things verbatim as you got them off the wire or
the disk.  But even there, as long as we use the natural embedding of
bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
particularly inefficient to use strings.

two conversions are slower than none, and use 2-4 times the space in string format.

For anything else, storing in wire format is going to require checking
format (of the stored data if the format is variable, and always of
the requesting API) on all attribute accesses, and conversions on
many, even most attribute accesses.

One has to write the conversion code anyway; it is just a matter of where it is called. Once converted, meta data could be retained in its natural format.

> One problem with storing messages in bytes format: it seems to me that > the choice of which of several legal email bytes formats

None of them are very happy.  The email module needs to be able to
both read and produce all of 7bit, 8bit, and binary, and they are in
fact pretty well trivial to do.

So the question to me is "what are the primary use cases for the email
module, and how do they affect the choice of internal representation?"
I can't claim special expertise on "how", I'll leave that up to
Barry.  Here are some use cases I can think of.

Yes this is a good question.


1.  Debugging programs using the email module.  Maybe that's a +1 for
    internally storing textual data in string form.

2.  MUA #1: Composition.  Input will be strings and multimedia file
    names, output will be bytes.  Will attributes of message objects
    be manipulated?  Not in a conventional MUA, but an email-based MUA
    might find uses for that.

I'm not sure what an email-based MUA is.... seems to me even a conventional MUA is "email-based"???


3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
    data).  Could be strings, though, depending on the internal format
    of folders.  Output will be strings and multimedia objects.  Lots
    of string processing, especially generating folder directory
    displays from message headers.

4.  Mailing list processor.  Message input will be bytes.
    Configuration input, including heading and footer texts that may
    be added are likely to be strings.  Header manipulation (adding
    topics, sequence numbers, RFC 2369 headers) most conveniently done
    with strings.  Output will be bytes.

But the bulk of the message parts, received in wire format, may not need to be altered to be sent along in the same wire format. Headers must be manipulated somehow, I'd think it would be convenient as strings too. Heading and footing texts are configured boilerplate, and could be cached in a variety of formats to avoid the need to convert them for each message, and could then be obtained from the cache in the appropriate format for this particular message, and prepended or appended as appropriate.

5.  Mailing list archiver.  Input will be bytes or message objects,
    output will be strings (typically HTML documents or XML
    fragments).

An archiver could archive wire format, and do the conversions to *ML on the fly for those messages that might be accessed that way. Depends on the expectation of the usage of the archiver... to retrieve the archived messages via email, wire format could be extremely efficient; to retrieve via HTTP, one should note that there is very little difference between .eml format (another name for wire format) and .mthml format (which is a format IE and Opera will display natively, support in other browsers varies, mostly via addons and conversion utilities). So I'm not at all sure that this use case requires string output, although some implementations might prefer it.

6.  Spam/virus detection.  Input may be bytes or message objects.
    Lots of internal string processing; in most cases the text/* parts
    need to be converted to strings before grepping; in some cases
    even images or executables may be reconstituted to look for
    malware signatures.  Output may be a flag or signal, or the
    message itself may be edited (typically to provide headers
    recording degree of spamminess, trace headers, maybe a body
    heading; in some cases, a new message may be generated with the
    suspected spam as a message/rfc822 MIME body part).


So it seems to me that storing the data in the format provided, and converting it to native format when requested and caching that result, and then when generating wire format, if the needed format was not provided or cached, then converting as necessary, would be optimal to minimize conversion (time) costs. This technique would also maximally preserve the original format for use cases 3 and 5, which, for use case 3, at least, seems to be important to this list from past discussion. To minimize memory (space) costs, the caching could be avoided (causing reconversion costs), or, at the expense of not preserving the original format, once converted, retain only the native format of the item (which is generally the smallest, for binary objects, and which is most easily manipulated, but not necessarily smallest, for text objects).

So I'd design the internal format with meta data like

MIMEpart
   formatFlag
   metaData
   7bitData
   8bitData
   binaryData
   nativeText
   nativeBLOB

where the metaData would consist of a variety of pertinent items, obtained by decoding provided wireData or supplied along with provided nativeData.

Generate could use 7bitData, 8bitData, or binaryData directly if it exists, or cache it there if it didn't already exist.

binaryData would differ from nativeBLOB only by containing the appropriate MIMEheaders... perhaps as a space optimization, it would contain only the appropriate MIMEheaders, with the binaryData being placed in nativeBLOB directly (since this is not a costly conversion, just a choice of where to store the bytes).

It could also be possible that a complete, provided, wire format message would be retained as a single BLOB, and the appropriate format data items simply be offsets and lengths within that BLOB, although with cached metaData.

Of course, there is already a design within the existing code, and the cost of wholesale redesign may be more than can be afforded.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to