Re: [Email-SIG] Thoughts on the general API, and the Header API.

Glenn Linderman Mon, 01 Feb 2010 11:07:04 -0800

Another thought occurred to me regarding this "Access API"... an IMAPimplementation could defer obtaining data parts from the server untilrequested, under the covers of this same API. Of course, for deviceswith limited resources, that would probably be the optimal approach, butfor devices with lots of resources, an IMAP implementation might alsowant to offer other options.

On approximately 1/28/2010 6:20 PM, came the following characters fromthe keyboard of Glenn Linderman:

On approximately 1/25/2010 8:10 PM, came the following characters fromthe keyboard of Glenn Linderman:
That's true.  The Bytes and String versions of binary MIME parts,
which are likely to be the large ones, will probably have a common
representation for the payload, and could potentially point to the same
object.  That breaking of of the expectation that 'encode' and 'decode'
return new objects (in analogy to how encode and decode ofstrings/bytes
works) might not be a good thing, though.
Well, one generator could provide the expectation that everything isnew; another could provide different expectations. The differencesbetween them, and the tradeoffs would be documented, of course, wereboth provided. I'm not convinced that treating headers and dataexactly the same at all times is a good thing... a convenient optionat times, perhaps, but I can see it as a serious inefficiency in manyuse cases involving large data.
This deserves a bit more thought/analysis/discussion, perhaps. Morethan I have time for tonight, but I may reply again, perhaps afterothers have responded, if they do.
I guess no one else is responding here at the moment. Read the ideasbelow, and then afterward, consider building the APIs you've suggestedon top of them. And then, with the full knowledge that the messagesmay be either in fast or slow storage, I think that you'll agree thatconverting the whole tree in one swoop isn't always appropriate... allheaders, probably could be. Data, because of its size, shouldprobably be done on demand.
In earlier discussions about the registry, there was the idea ofhaving a registry for transport encoding handling, and a registry forMIME encoding handling. There were also vague comments about doing anexternal storage protocol "somehow", but it was a vague concept to bedefined later, or at least I don't recall any definitions.
Given a raw bytes representation of an incoming email, mail serversneed to choose how to handle it... this may need to be a dynamicchoice based on current server load, as well as the obvious staticserver resources, as well as configured limits.
Unfortunately, the SMTP protocol does not require predeclaration ofthe size of the incoming DATA part, so servers cannot enforce sizelimits until they are exceeded. So as the data streams in, a dynamicadjustment to the handling strategy might be appropriate. Gatewaysmay choose to route messages, and stall the input until the outputchannel is ready to receive it, and basically "pass through" the data,with limited need to buffer messages on disk... unless the outputchannel doesn't respond... then they might reject the message. AnSMTP server should be willing to act as a store-and-forward server,and also must do individual delivery of messages to each RCPT (or atleast one per destination domain), so must have a way of dealing withlarge messages, probably via disk buffering. The case of diskbuffering and retrying generally means that the whole message, notjust the large data parts, must be stored on disk, so the externalstorage protocol should be able to deal with that case.
The minimal external storage format capability is to store thereceived bytestream to disk, associate it with the envelopeinformation, and be able to retrieve it in whole later. This wouldrequire having the whole thing in RAM at those two points in time,however, and doesn't solve the real problem. Incremental writing andreading to the external storage would be much more useful. Even moreuseful, would be "partially parsed" seek points.
An external storage system that provides "partially parsed"information could include:
1) envelope information. This section is useful to SMTP servers, butnot other email tools, so should be optional. This could be a copy ofthe received RCPT command texts, complete with CRLF endings.
2) header information. This would be everything between DATA and thefirst CRLF CRLF sequence.
3) data. Pre-MIME this would simply be the rest of the message, butpost-MIME it would be usefully more complex. If MIME headers can beobserved and parsed as the data passes through, then additionalmetadata could be saved that could enhance performance of the laterprocessing steps. Such additional metadata could include thebeginning of each MIME part, the end of the headers for that part, andthe end of the data for that part.
The result of saving that information would mean that minimal data(just headers) would need to be read in create a tree representing theemail, the rest could be left in external storage until it isaccessed... and then obtained directly from there when needed, andconverted to the form required by the request... either the wholepart, or some piece in a buffer.
So there could be a variety of external storage systems... one thatstores in memory, one that stores on disk per the ideas above, and avariety that retain some amount of cached information about the email,even though they store it all on disk. Sounds like this could be aplug-in, or an attribute of a message object creation.
But to me, it sounds like the foundation upon which the whole emaillib should be built, not something that is shoveled in later.
A further note about access to data parts... clearly "data for thewhole MIME part" could be provided, but even for a single part thatcould be large. So access to smaller chunks might be desired.
The data access/conversion functions, therefore, should support abuffer-at-a-time access interface. Base64 supports random accesseasily, unless it contains characters not in the 64, that are to beignored, that could throw off the size calculations. So maybeproviding sequential buffer-at-a-time access with rewind is the bestthat can be done -- quoted-printable doesn't support random accessvery well, and neither would some sort of compression or encryptiontechnique -- they usually like to start from the beginning -- andthose are the sorts of things that I would consider likely to bestandardized in the future, to reduce the size of the payload, and toincrease the security of the payload.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] Thoughts on the general API, and the Header API.

Reply via email to