On approximately 1/25/2010 8:10 PM, came the following characters from the keyboard of Glenn Linderman:
That's true.  The Bytes and String versions of binary MIME parts,
which are likely to be the large ones, will probably have a common
representation for the payload, and could potentially point to the same
object.  That breaking of of the expectation that 'encode' and 'decode'
return new objects (in analogy to how encode and decode of strings/bytes
works) might not be a good thing, though.

Well, one generator could provide the expectation that everything is new; another could provide different expectations. The differences between them, and the tradeoffs would be documented, of course, were both provided. I'm not convinced that treating headers and data exactly the same at all times is a good thing... a convenient option at times, perhaps, but I can see it as a serious inefficiency in many use cases involving large data.

This deserves a bit more thought/analysis/discussion, perhaps. More than I have time for tonight, but I may reply again, perhaps after others have responded, if they do.

I guess no one else is responding here at the moment. Read the ideas below, and then afterward, consider building the APIs you've suggested on top of them. And then, with the full knowledge that the messages may be either in fast or slow storage, I think that you'll agree that converting the whole tree in one swoop isn't always appropriate... all headers, probably could be. Data, because of its size, should probably be done on demand.


In earlier discussions about the registry, there was the idea of having a registry for transport encoding handling, and a registry for MIME encoding handling. There were also vague comments about doing an external storage protocol "somehow", but it was a vague concept to be defined later, or at least I don't recall any definitions.

Given a raw bytes representation of an incoming email, mail servers need to choose how to handle it... this may need to be a dynamic choice based on current server load, as well as the obvious static server resources, as well as configured limits.

Unfortunately, the SMTP protocol does not require predeclaration of the size of the incoming DATA part, so servers cannot enforce size limits until they are exceeded. So as the data streams in, a dynamic adjustment to the handling strategy might be appropriate. Gateways may choose to route messages, and stall the input until the output channel is ready to receive it, and basically "pass through" the data, with limited need to buffer messages on disk... unless the output channel doesn't respond... then they might reject the message. An SMTP server should be willing to act as a store-and-forward server, and also must do individual delivery of messages to each RCPT (or at least one per destination domain), so must have a way of dealing with large messages, probably via disk buffering. The case of disk buffering and retrying generally means that the whole message, not just the large data parts, must be stored on disk, so the external storage protocol should be able to deal with that case.

The minimal external storage format capability is to store the received bytestream to disk, associate it with the envelope information, and be able to retrieve it in whole later. This would require having the whole thing in RAM at those two points in time, however, and doesn't solve the real problem. Incremental writing and reading to the external storage would be much more useful. Even more useful, would be "partially parsed" seek points.

An external storage system that provides "partially parsed" information could include:

1) envelope information. This section is useful to SMTP servers, but not other email tools, so should be optional. This could be a copy of the received RCPT command texts, complete with CRLF endings.

2) header information. This would be everything between DATA and the first CRLF CRLF sequence.

3) data. Pre-MIME this would simply be the rest of the message, but post-MIME it would be usefully more complex. If MIME headers can be observed and parsed as the data passes through, then additional metadata could be saved that could enhance performance of the later processing steps. Such additional metadata could include the beginning of each MIME part, the end of the headers for that part, and the end of the data for that part.

The result of saving that information would mean that minimal data (just headers) would need to be read in create a tree representing the email, the rest could be left in external storage until it is accessed... and then obtained directly from there when needed, and converted to the form required by the request... either the whole part, or some piece in a buffer.

So there could be a variety of external storage systems... one that stores in memory, one that stores on disk per the ideas above, and a variety that retain some amount of cached information about the email, even though they store it all on disk. Sounds like this could be a plug-in, or an attribute of a message object creation.

But to me, it sounds like the foundation upon which the whole email lib should be built, not something that is shoveled in later.

A further note about access to data parts... clearly "data for the whole MIME part" could be provided, but even for a single part that could be large. So access to smaller chunks might be desired.

The data access/conversion functions, therefore, should support a buffer-at-a-time access interface. Base64 supports random access easily, unless it contains characters not in the 64, that are to be ignored, that could throw off the size calculations. So maybe providing sequential buffer-at-a-time access with rewind is the best that can be done -- quoted-printable doesn't support random access very well, and neither would some sort of compression or encryption technique -- they usually like to start from the beginning -- and those are the sorts of things that I would consider likely to be standardized in the future, to reduce the size of the payload, and to increase the security of the payload.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to