[Email-SIG] API thoughts

R. David Murray Tue, 01 Mar 2011 12:43:20 -0800

This is a long email, for which my apologies.  I hope you all will
manage to find some time to read it and provide feedback, as it speaks
to fundamental design issues.


My subconscious seems to have been very busy last night, since in the
shower this morning it presented me with a whole bunch of thoughts about
the email API.  This was triggered, I think, by Barry's question about
__version__, my response that we might want an 'api version' declaration,
and some comments made during the email 5.1 discussion by Steven D'Arapano
(I think) about how Message is really the idealized representation of
an email message.

Let me start by saying that I think we can all agree that the fundamental
design of the email package is excellent:  we have a Parser which handles
taking input from the outside world and turning it into a Message, and
we have a Generator which handles taking a Message and turning it into
something the outside world can handle.  In the focus of the original
development the "outside world" was, sensibly, RFC 822/2822 encoded
byte streams.

The idealized message consists of some meta information (addressee,
recipient, date, etc, etc) and a body.  The body, the content, can be
arbitrarily complex.  The purpose of the message is to convey some of
that meta information and all of the arbitrarily complex body content
from the sender to the recipient.

Everything else is an implementation detail :)

So, if we are writing a program and we want to compose such a message, it
makes sense that we can build up this idealized message from its component
pieces by attaching objects representing those pieces to the Message.
At that stage we care nothing about how it needs to be transformed to
get from point A to point B.

If we want to look at a message, we again don't are about how it was
transformed to get from point A to point B, we just want to be able to
access the content in its original form.

In today's "outside world" we have more to worry about than just
RFC822/2822/5322.  The "outside world" could be an http transmission
medium.  It could (if we re-design things right:) be a SIP session.
It could be a disk-based data store, where an RFC822-like message format
is being used to store data.  I'm sure there are other contexts as well.

So keeping the external representation concerns separate from the
idealized message model makes sense.

The email4/5 API doesn't do this as successfully as it could, especially
in a Python3 context.  The application program dealing with the idealized
message doesn't really care what character set any given piece of a header
is encoded in, it really just wants to deal with complete unicode strings.
The application program also really doesn't care about the MIME type of a
piece of content, it just wants to manage an object that has methods that
allow it to manipulate that image, or that audio file, or what have you.
Of course, it also needs to know what type of object it is handling in an
incoming message, but the mime type is only one piece of the information
that determines that (albeit usually the most important one).

(Yes, some applications *do* care about internal details...but those
are special cases and we can provide additional APIs that allow access
at that level for those applications that need it, as we have discussed
previously.)

We propose to create a new API to make all of this easier for
the application programmer.  What doesn't change is the fundamental
structure of the package:  a message in some transmission format is
fed to a Parser, which produces a Message object.  A Message object
can be fed to a Generator, which produces a transmission format object.
Now, I lost sight of this a bit while I was working on the email6 header
classes, as Barry at least will remember, but I do think it is important,
and I want to keep it in the forefront of my mind as I work on adding
the proposed policy framework.

So, and here is the point of this email, how does the policy framework
integrate into this design?

I said that the policy pulls together the tunable bits of the email
package's algorithms.  What does this mean?  What are the tunable
bits?  Here are some candidates:

    maximum header line length on serialization
    line ending character on serialization
    whether or not to raise an exception if a defect is encountered
        during parsing
    how much transformation of untouched original data is permissible
        when re-serializing a message
    can the serialized form contain any non-ASCII data?
    what classes to use to represent various MIME types.
   
These are all decisions that can be made one way or another by an
application program using the current package.  Often, however, modifying
the default is not easy or convenient.  Note that the last one can only
be decided by an application program when constructing a message, not
when parsing one.

Here are some other things that it might be useful to be able to
control:

    what string to use as the continuation whitespace when needing
        to add some
    what classes to use to represent various structured headers
    what exactly counts as a defect
    should headers be RFC2047 encoded on serialization, or
        should another encoding be used?[*]

[*] There are current real-world use cases for this:  there are nntp
    servers that use utf-8 for headers, and the http protocol uses
    latin-1 (or sometimes, I think, utf-8)

This list breaks down into items that affect the Parser, ones that affect
the Generator, and ones that affect both the Parser and the Message.
(Well, the "how much transformation" affects all three in the sense that
the data has to be preserved by both the Parser and the Message in order
for the Generator to be able to implement it, but I think we can take
it as a given that we are going to preserve that data.)

The pieces that are shared between the Parser and the Message are really
about the Message:  how are the sub-objects represented?  How are the
structured headers represented?  So we could consider that the Parser
is a *consumer* of those pieces of policy, but that they are defined on
the Message, not on the Parser.

What this means is that the policy controlling each of the major
components (parser, message, generator) are in principle independent.

The design of the policy framework envisions having, for example, an
'HTTP' policy that would, say, expect and generate latin-1 encoded
headers, and generate headers without line breaks, using CRLF for the
line termination.  Initially I thought one would declare a policy
and that the Message object would remember that policy, but that you
could override it when, say, calling the generator.

Re-thinking it now, though, I think there are actually two distinct
components here: the I/O policy(s), and the Message construction policy.
That is, the things that the HTTP policy cares about are all Parser or
Generator controls.  The only things the Message (should) care about is
how to represent its components.  The Message is thus independent of any
policy *except* the header/mime classes, while the Parser and Generator
can be consumers of the header/mime class policy used to construct the
Message.  It nevertheless makes sense to group the parser and generator
policy controls together, since that is how we conceptually think of them
('HTML' implies a coherent set of input and output policies).

So, I think the "policy framework" is actually two things:  the
header/mime-types registry, and the Parser/Generator policies.  Let's have
'policy' refer to only the I/O policy, and call the other the email
class registry.

This narrower definition of policy is a straightforward enhancement
of the current API.  It makes these "knobs" more easily controlled,
and makes it easier to add new knobs without complicating the API.
I propose that I write up this policy API as a distinct proposal/patch
(with the work I've already done, this is more than half completed).
This would add policy keywords to the Parser and Generator classes,
and probably to the as_string method of Message.

The real meat of email6, then, is the header/mime-types registry, and
the changes in the API of the resulting Message objects.  The parser
currently accepts a _factory argument that specifies the object to be used
in creating the Message.   I propose that we deprecate this argument,
but that any code using it gets the old behavior of the parser (using
_factory to create the class for any new sub-objects).  Then we introduce
a new argument, 'factory'.  This new argument would expect a callable
that takes a mime-type as its argument, and returns an appropriate class.
The parser would be re-written so that it could use this factory, and
the backward compatibility case would be trivial to implement.

In theory the classes returned by the registry/factory are arbitrary,
but in practice we will need to define the minimal API that they
should provide.  By specifying the API separately from the concrete
implementation in email6, we will allow third parties to write classes
that can play well with programs expecting to operate on email6 Messages.
This will allow, for example, an MUA to provide custom classes to enhance
presentation, while still allowing the message to be submitted to smtplib
for transmission.

I guess I'm proposing, then, that there be an API version definition,
with two values as of Python3.3: email5 API, and email6 API.  We'll
figure out how we name and interrogate these formally later.

The Header registry in this vision is accessed through the Message class.
I have various thoughts about how this will work, but I'm going to leave
those for later, since this email is already long enough.  I also have
some additional thoughts about backward compatibility, but it is going
to require some experimentation to see if they are realistic.

--David
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

[Email-SIG] API thoughts

Reply via email to