On approximately 10/9/2009 5:05 AM, came the following characters from the keyboard of Barry Warsaw:
On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote:
1) wire format. Either what came in, in the parser case, or what would be generated.
2) internal headers from the MIME part
3) decoded BLOB. This means that quopri and base64 are decoded, no more and no less. This is bytes. No headers, only payload. For Content-Transfer-Encoding: binary, this is mostly a noop. 4) text/* parts should also be obtainable as str()/unicode(), payload only. This is where charset decoding is done.

I think your talk in the next paragraph about hooks and other object types being produced is a generalization of 4, not 3, and generally no additional decoding needs to be done, just conversion to the right object type (or file, or file-like object).
I mostly agree with that. I've always called #4 the "decoded payload" and #3 I've usually called the "raw payload". Maybe we can bikeshed on better terms to help inform us about the API's method/attribute names.

It would be good though to have standardized terms for easier communication. Maybe as they are chosen, they could be added to that Wiki RDM set up?

My only problem with "raw" and "decoded" payload, is that there are 3 payload formats, not 2, so there needs to be a 3rd term, corresponding to #1, #3, and #4, above. #2 is somewhat orthogonal from the payload.

To me, "raw" conjures up #1, not #3.

If Content-Transfer-Encoding is 7bit, 8bit, or binary, then 2 is the same as 1, it is just a terminology change. Only for Content-Transfer-Encoding of quoted-printable or base64 must work be done to convert from #1 to #3.

If Content-Type is text/*, then the transformation from 2 to 3 is more than a cast, but for many other formats, it is mostly a cast.

Which brings up another point: right now Message objects have a single .get_payload() method that takes a flag to indicate whether it should be the decoded or raw payload. That's bong. These should be different interfaces.

Separate APIs would be clearer, but for compatibility, should .get_payload() be retained, with the flag? Fortunately, there is only one result value in any case, so it is just a matter of what the type of that output value is, and how it should be handled.

Perhaps the flag parameter should be extended to allow retrieval of all three payload formats instead of only two?

.get_payload could be converted to call the appropriate specific APIs, should it be desired to invent separate APIs for each payload format.

The problem is that if the bytes came off the wire, the parser currently can only attach the most basic MIME base class. It doesn't know that an image/png should create a MIMEImagePNG instance there. This is different from hacking the model directly because the application can instantiate the right class. So the parser either has to have a hookable way for an application to go from content-type to class, or the generic MIME base class needs to be hookable in its .decode() method.

So either the email package can stop at 3, and 4 only for text/* parts, or it could learn more types (registered types, with well-defined corresponding objects could be potentially built-in to the email package), and/or it could become hookable for application types. Of course, for disposition to files, storing the BLOB in a file of the right name is adequate... to avoid the file, I agree that converting to a useful object type is handy. But maybe file-like objects would suffice, for most of the types.

My own preferences here is that email does support #4 with a registration system to handle returning concrete payload objects based on the Content-Type.

Sure, a registration system is fine. It could work for any type that has a method that can be registered, that accepts a binary BLOB and returns an appropriate typed and functioning object that can manipulate that type. That would mean that the application would have to make all the registration calls up front, instead of making the API calls when the objects are retrieved. Basically, if the email package doesn't have a registration system that the application can use, the application has to invent its own, so this is work that could benefit all applications.

I suppose the default registration for text/* would be to convert from whatever to Unicode, and the default registration for all other Content-Type would be to pass back bytes(). Or maybe a few other common types, for which specific types are available, some specific image/* types, perhaps, that seems to have MIME types defined for them, although perhaps people may still prefer to register, say, a PIL type, for images, so I agree the email package should only provide default registrations. On the other hand, I'm not sure how the registration system should work with threads, if different threads want different registrations...


Actually, although it is not common practice to have encodings other than the RFC defined base64 and quoted-printable, a registration system for converting from #1 to #3, with appropriate defaults for base64, quoted-printable, binary, 7bit, 8bit, would be appropriate, and would provide a framework for allowing easy extensions to the encodings. Future mail RFCs may define some, but more likely, applications that wish to use email transports, where both ends are application controlled, might wish to define other encodings... the RFCs do allow for x-* encodings that are user defined. If a registration system is created for #3 to #4 encodings, the same mechanism could likely be use for the registration system for #1 to #3 encodings, so there would be added flexibility at very little cost.

I also think that the email package probably should not implement "store-payloads-on-disk" by default, although it may provide some example implementations for simple applications (much the same way there's wsgiref for simple applications).

Thinking about this, I agree that storing payloads on disk should not be the default action. However, if an application wants to control its memory consumption, the receipt of a large email could negatively impact that desire. It might be appropriate to place individual MIME parts on disk, as they are parsed, if the application indicates a threshold part size and/or threshold aggregate size, beyond which parts should be placed in cache. Along with that, the temporary storage location in which to place them would have to be configured.

Still, that's different than say, storing attachments in a file named by the Content-Disposition header's filename parameter. That latter is firmly in the domain of the application.

I again agree that this should not be the default action, but I assume that an API should be provided such that an application could tell the email package to place the content in the header's filename parameter. If such an API doesn't already exist, it seems it would be a helpful extension, and if the part was already cached on disk because of the above thresholds, the email package could possibly use rename instead of file copy to achieve the goal.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to