Another thought occurred to me regarding this "Access API"... an IMAP
implementation could defer obtaining data parts from the server until
requested, under the covers of this same API. Of course, for devices
with limited resources, that would probably be the optimal approach, but
for devices with lots of resources, an IMAP implementation might also
want to offer other options.
On approximately 1/28/2010 6:20 PM, came the following characters from
the keyboard of Glenn Linderman:
On approximately 1/25/2010 8:10 PM, came the following characters from
the keyboard of Glenn Linderman:
That's true. The Bytes and String versions of binary MIME parts,
which are likely to be the large ones, will probably have a common
representation for the payload, and could potentially point to the same
object. That breaking of of the expectation that 'encode' and 'decode'
return new objects (in analogy to how encode and decode of
strings/bytes
works) might not be a good thing, though.
Well, one generator could provide the expectation that everything is
new; another could provide different expectations. The differences
between them, and the tradeoffs would be documented, of course, were
both provided. I'm not convinced that treating headers and data
exactly the same at all times is a good thing... a convenient option
at times, perhaps, but I can see it as a serious inefficiency in many
use cases involving large data.
This deserves a bit more thought/analysis/discussion, perhaps. More
than I have time for tonight, but I may reply again, perhaps after
others have responded, if they do.
I guess no one else is responding here at the moment. Read the ideas
below, and then afterward, consider building the APIs you've suggested
on top of them. And then, with the full knowledge that the messages
may be either in fast or slow storage, I think that you'll agree that
converting the whole tree in one swoop isn't always appropriate... all
headers, probably could be. Data, because of its size, should
probably be done on demand.
In earlier discussions about the registry, there was the idea of
having a registry for transport encoding handling, and a registry for
MIME encoding handling. There were also vague comments about doing an
external storage protocol "somehow", but it was a vague concept to be
defined later, or at least I don't recall any definitions.
Given a raw bytes representation of an incoming email, mail servers
need to choose how to handle it... this may need to be a dynamic
choice based on current server load, as well as the obvious static
server resources, as well as configured limits.
Unfortunately, the SMTP protocol does not require predeclaration of
the size of the incoming DATA part, so servers cannot enforce size
limits until they are exceeded. So as the data streams in, a dynamic
adjustment to the handling strategy might be appropriate. Gateways
may choose to route messages, and stall the input until the output
channel is ready to receive it, and basically "pass through" the data,
with limited need to buffer messages on disk... unless the output
channel doesn't respond... then they might reject the message. An
SMTP server should be willing to act as a store-and-forward server,
and also must do individual delivery of messages to each RCPT (or at
least one per destination domain), so must have a way of dealing with
large messages, probably via disk buffering. The case of disk
buffering and retrying generally means that the whole message, not
just the large data parts, must be stored on disk, so the external
storage protocol should be able to deal with that case.
The minimal external storage format capability is to store the
received bytestream to disk, associate it with the envelope
information, and be able to retrieve it in whole later. This would
require having the whole thing in RAM at those two points in time,
however, and doesn't solve the real problem. Incremental writing and
reading to the external storage would be much more useful. Even more
useful, would be "partially parsed" seek points.
An external storage system that provides "partially parsed"
information could include:
1) envelope information. This section is useful to SMTP servers, but
not other email tools, so should be optional. This could be a copy of
the received RCPT command texts, complete with CRLF endings.
2) header information. This would be everything between DATA and the
first CRLF CRLF sequence.
3) data. Pre-MIME this would simply be the rest of the message, but
post-MIME it would be usefully more complex. If MIME headers can be
observed and parsed as the data passes through, then additional
metadata could be saved that could enhance performance of the later
processing steps. Such additional metadata could include the
beginning of each MIME part, the end of the headers for that part, and
the end of the data for that part.
The result of saving that information would mean that minimal data
(just headers) would need to be read in create a tree representing the
email, the rest could be left in external storage until it is
accessed... and then obtained directly from there when needed, and
converted to the form required by the request... either the whole
part, or some piece in a buffer.
So there could be a variety of external storage systems... one that
stores in memory, one that stores on disk per the ideas above, and a
variety that retain some amount of cached information about the email,
even though they store it all on disk. Sounds like this could be a
plug-in, or an attribute of a message object creation.
But to me, it sounds like the foundation upon which the whole email
lib should be built, not something that is shoveled in later.
A further note about access to data parts... clearly "data for the
whole MIME part" could be provided, but even for a single part that
could be large. So access to smaller chunks might be desired.
The data access/conversion functions, therefore, should support a
buffer-at-a-time access interface. Base64 supports random access
easily, unless it contains characters not in the 64, that are to be
ignored, that could throw off the size calculations. So maybe
providing sequential buffer-at-a-time access with rewind is the best
that can be done -- quoted-printable doesn't support random access
very well, and neither would some sort of compression or encryption
technique -- they usually like to start from the beginning -- and
those are the sorts of things that I would consider likely to be
standardized in the future, to reduce the size of the payload, and to
increase the security of the payload.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options:
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com