Re: [Email-SIG] Thoughts on the general API, and the Header API.

Glenn Linderman Thu, 28 Jan 2010 18:20:36 -0800

On approximately 1/25/2010 8:10 PM, came the following characters fromthe keyboard of Glenn Linderman:

That's true.  The Bytes and String versions of binary MIME parts,
which are likely to be the large ones, will probably have a common
representation for the payload, and could potentially point to the same
object.  That breaking of of the expectation that 'encode' and 'decode'
return new objects (in analogy to how encode and decode of strings/bytes
works) might not be a good thing, though.
Well, one generator could provide the expectation that everything isnew; another could provide different expectations. The differencesbetween them, and the tradeoffs would be documented, of course, wereboth provided. I'm not convinced that treating headers and dataexactly the same at all times is a good thing... a convenient optionat times, perhaps, but I can see it as a serious inefficiency in manyuse cases involving large data.
This deserves a bit more thought/analysis/discussion, perhaps. Morethan I have time for tonight, but I may reply again, perhaps afterothers have responded, if they do.

I guess no one else is responding here at the moment. Read the ideasbelow, and then afterward, consider building the APIs you've suggestedon top of them. And then, with the full knowledge that the messages maybe either in fast or slow storage, I think that you'll agree thatconverting the whole tree in one swoop isn't always appropriate... allheaders, probably could be. Data, because of its size, should probablybe done on demand.

In earlier discussions about the registry, there was the idea of havinga registry for transport encoding handling, and a registry for MIMEencoding handling. There were also vague comments about doing anexternal storage protocol "somehow", but it was a vague concept to bedefined later, or at least I don't recall any definitions.

Given a raw bytes representation of an incoming email, mail servers needto choose how to handle it... this may need to be a dynamic choice basedon current server load, as well as the obvious static server resources,as well as configured limits.

Unfortunately, the SMTP protocol does not require predeclaration of thesize of the incoming DATA part, so servers cannot enforce size limitsuntil they are exceeded. So as the data streams in, a dynamicadjustment to the handling strategy might be appropriate. Gateways maychoose to route messages, and stall the input until the output channelis ready to receive it, and basically "pass through" the data, withlimited need to buffer messages on disk... unless the output channeldoesn't respond... then they might reject the message. An SMTP servershould be willing to act as a store-and-forward server, and also must doindividual delivery of messages to each RCPT (or at least one perdestination domain), so must have a way of dealing with large messages,probably via disk buffering. The case of disk buffering and retryinggenerally means that the whole message, not just the large data parts,must be stored on disk, so the external storage protocol should be ableto deal with that case.

The minimal external storage format capability is to store the receivedbytestream to disk, associate it with the envelope information, and beable to retrieve it in whole later. This would require having the wholething in RAM at those two points in time, however, and doesn't solve thereal problem. Incremental writing and reading to the external storagewould be much more useful. Even more useful, would be "partiallyparsed" seek points.

An external storage system that provides "partially parsed" informationcould include:

1) envelope information. This section is useful to SMTP servers, butnot other email tools, so should be optional. This could be a copy ofthe received RCPT command texts, complete with CRLF endings.

2) header information. This would be everything between DATA and thefirst CRLF CRLF sequence.

3) data. Pre-MIME this would simply be the rest of the message, butpost-MIME it would be usefully more complex. If MIME headers can beobserved and parsed as the data passes through, then additional metadatacould be saved that could enhance performance of the later processingsteps. Such additional metadata could include the beginning of eachMIME part, the end of the headers for that part, and the end of the datafor that part.

The result of saving that information would mean that minimal data (justheaders) would need to be read in create a tree representing the email,the rest could be left in external storage until it is accessed... andthen obtained directly from there when needed, and converted to the formrequired by the request... either the whole part, or some piece in a buffer.

So there could be a variety of external storage systems... one thatstores in memory, one that stores on disk per the ideas above, and avariety that retain some amount of cached information about the email,even though they store it all on disk. Sounds like this could be aplug-in, or an attribute of a message object creation.

But to me, it sounds like the foundation upon which the whole email libshould be built, not something that is shoveled in later.

A further note about access to data parts... clearly "data for the wholeMIME part" could be provided, but even for a single part that could belarge. So access to smaller chunks might be desired.

The data access/conversion functions, therefore, should support abuffer-at-a-time access interface. Base64 supports random accesseasily, unless it contains characters not in the 64, that are to beignored, that could throw off the size calculations. So maybe providingsequential buffer-at-a-time access with rewind is the best that can bedone -- quoted-printable doesn't support random access very well, andneither would some sort of compression or encryption technique -- theyusually like to start from the beginning -- and those are the sorts ofthings that I would consider likely to be standardized in the future, toreduce the size of the payload, and to increase the security of the payload.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] Thoughts on the general API, and the Header API.

Reply via email to