This is a long email, for which my apologies. I hope you all will manage to find some time to read it and provide feedback, as it speaks to fundamental design issues.
My subconscious seems to have been very busy last night, since in the shower this morning it presented me with a whole bunch of thoughts about the email API. This was triggered, I think, by Barry's question about __version__, my response that we might want an 'api version' declaration, and some comments made during the email 5.1 discussion by Steven D'Arapano (I think) about how Message is really the idealized representation of an email message. Let me start by saying that I think we can all agree that the fundamental design of the email package is excellent: we have a Parser which handles taking input from the outside world and turning it into a Message, and we have a Generator which handles taking a Message and turning it into something the outside world can handle. In the focus of the original development the "outside world" was, sensibly, RFC 822/2822 encoded byte streams. The idealized message consists of some meta information (addressee, recipient, date, etc, etc) and a body. The body, the content, can be arbitrarily complex. The purpose of the message is to convey some of that meta information and all of the arbitrarily complex body content from the sender to the recipient. Everything else is an implementation detail :) So, if we are writing a program and we want to compose such a message, it makes sense that we can build up this idealized message from its component pieces by attaching objects representing those pieces to the Message. At that stage we care nothing about how it needs to be transformed to get from point A to point B. If we want to look at a message, we again don't are about how it was transformed to get from point A to point B, we just want to be able to access the content in its original form. In today's "outside world" we have more to worry about than just RFC822/2822/5322. The "outside world" could be an http transmission medium. It could (if we re-design things right:) be a SIP session. It could be a disk-based data store, where an RFC822-like message format is being used to store data. I'm sure there are other contexts as well. So keeping the external representation concerns separate from the idealized message model makes sense. The email4/5 API doesn't do this as successfully as it could, especially in a Python3 context. The application program dealing with the idealized message doesn't really care what character set any given piece of a header is encoded in, it really just wants to deal with complete unicode strings. The application program also really doesn't care about the MIME type of a piece of content, it just wants to manage an object that has methods that allow it to manipulate that image, or that audio file, or what have you. Of course, it also needs to know what type of object it is handling in an incoming message, but the mime type is only one piece of the information that determines that (albeit usually the most important one). (Yes, some applications *do* care about internal details...but those are special cases and we can provide additional APIs that allow access at that level for those applications that need it, as we have discussed previously.) We propose to create a new API to make all of this easier for the application programmer. What doesn't change is the fundamental structure of the package: a message in some transmission format is fed to a Parser, which produces a Message object. A Message object can be fed to a Generator, which produces a transmission format object. Now, I lost sight of this a bit while I was working on the email6 header classes, as Barry at least will remember, but I do think it is important, and I want to keep it in the forefront of my mind as I work on adding the proposed policy framework. So, and here is the point of this email, how does the policy framework integrate into this design? I said that the policy pulls together the tunable bits of the email package's algorithms. What does this mean? What are the tunable bits? Here are some candidates: maximum header line length on serialization line ending character on serialization whether or not to raise an exception if a defect is encountered during parsing how much transformation of untouched original data is permissible when re-serializing a message can the serialized form contain any non-ASCII data? what classes to use to represent various MIME types. These are all decisions that can be made one way or another by an application program using the current package. Often, however, modifying the default is not easy or convenient. Note that the last one can only be decided by an application program when constructing a message, not when parsing one. Here are some other things that it might be useful to be able to control: what string to use as the continuation whitespace when needing to add some what classes to use to represent various structured headers what exactly counts as a defect should headers be RFC2047 encoded on serialization, or should another encoding be used?[*] [*] There are current real-world use cases for this: there are nntp servers that use utf-8 for headers, and the http protocol uses latin-1 (or sometimes, I think, utf-8) This list breaks down into items that affect the Parser, ones that affect the Generator, and ones that affect both the Parser and the Message. (Well, the "how much transformation" affects all three in the sense that the data has to be preserved by both the Parser and the Message in order for the Generator to be able to implement it, but I think we can take it as a given that we are going to preserve that data.) The pieces that are shared between the Parser and the Message are really about the Message: how are the sub-objects represented? How are the structured headers represented? So we could consider that the Parser is a *consumer* of those pieces of policy, but that they are defined on the Message, not on the Parser. What this means is that the policy controlling each of the major components (parser, message, generator) are in principle independent. The design of the policy framework envisions having, for example, an 'HTTP' policy that would, say, expect and generate latin-1 encoded headers, and generate headers without line breaks, using CRLF for the line termination. Initially I thought one would declare a policy and that the Message object would remember that policy, but that you could override it when, say, calling the generator. Re-thinking it now, though, I think there are actually two distinct components here: the I/O policy(s), and the Message construction policy. That is, the things that the HTTP policy cares about are all Parser or Generator controls. The only things the Message (should) care about is how to represent its components. The Message is thus independent of any policy *except* the header/mime classes, while the Parser and Generator can be consumers of the header/mime class policy used to construct the Message. It nevertheless makes sense to group the parser and generator policy controls together, since that is how we conceptually think of them ('HTML' implies a coherent set of input and output policies). So, I think the "policy framework" is actually two things: the header/mime-types registry, and the Parser/Generator policies. Let's have 'policy' refer to only the I/O policy, and call the other the email class registry. This narrower definition of policy is a straightforward enhancement of the current API. It makes these "knobs" more easily controlled, and makes it easier to add new knobs without complicating the API. I propose that I write up this policy API as a distinct proposal/patch (with the work I've already done, this is more than half completed). This would add policy keywords to the Parser and Generator classes, and probably to the as_string method of Message. The real meat of email6, then, is the header/mime-types registry, and the changes in the API of the resulting Message objects. The parser currently accepts a _factory argument that specifies the object to be used in creating the Message. I propose that we deprecate this argument, but that any code using it gets the old behavior of the parser (using _factory to create the class for any new sub-objects). Then we introduce a new argument, 'factory'. This new argument would expect a callable that takes a mime-type as its argument, and returns an appropriate class. The parser would be re-written so that it could use this factory, and the backward compatibility case would be trivial to implement. In theory the classes returned by the registry/factory are arbitrary, but in practice we will need to define the minimal API that they should provide. By specifying the API separately from the concrete implementation in email6, we will allow third parties to write classes that can play well with programs expecting to operate on email6 Messages. This will allow, for example, an MUA to provide custom classes to enhance presentation, while still allowing the message to be submitted to smtplib for transmission. I guess I'm proposing, then, that there be an API version definition, with two values as of Python3.3: email5 API, and email6 API. We'll figure out how we name and interrogate these formally later. The Header registry in this vision is accessed through the Message class. I have various thoughts about how this will work, but I'm going to leave those for later, since this email is already long enough. I also have some additional thoughts about backward compatibility, but it is going to require some experimentation to see if they are realistic. --David _______________________________________________ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com