Re: [Email-SIG] fixing the current email module

Glenn Linderman Fri, 09 Oct 2009 11:59:44 -0700

On approximately 10/9/2009 5:05 AM, came the following characters fromthe keyboard of Barry Warsaw:

On Oct 8, 2009, at 6:39 PM, Glenn Linderman wrote:
1) wire format. Either what came in, in the parser case, or whatwould be generated.
2) internal headers from the MIME part
3) decoded BLOB. This means that quopri and base64 are decoded, nomore and no less. This is bytes. No headers, only payload. ForContent-Transfer-Encoding: binary, this is mostly a noop.4) text/* parts should also be obtainable as str()/unicode(), payloadonly. This is where charset decoding is done.
I think your talk in the next paragraph about hooks and other objecttypes being produced is a generalization of 4, not 3, and generallyno additional decoding needs to be done, just conversion to the rightobject type (or file, or file-like object).
I mostly agree with that. I've always called #4 the "decoded payload"and #3 I've usually called the "raw payload". Maybe we can bikeshedon better terms to help inform us about the API's method/attribute names.

It would be good though to have standardized terms for easiercommunication. Maybe as they are chosen, they could be added to thatWiki RDM set up?

My only problem with "raw" and "decoded" payload, is that there are 3payload formats, not 2, so there needs to be a 3rd term, correspondingto #1, #3, and #4, above. #2 is somewhat orthogonal from the payload.


To me, "raw" conjures up #1, not #3.

If Content-Transfer-Encoding is 7bit, 8bit, or binary, then 2 is thesame as 1, it is just a terminology change. Only forContent-Transfer-Encoding of quoted-printable or base64 must work bedone to convert from #1 to #3.

If Content-Type is text/*, then the transformation from 2 to 3 is morethan a cast, but for many other formats, it is mostly a cast.

Which brings up another point: right now Message objects have a single.get_payload() method that takes a flag to indicate whether it shouldbe the decoded or raw payload. That's bong. These should bedifferent interfaces.

Separate APIs would be clearer, but for compatibility, should.get_payload() be retained, with the flag? Fortunately, there is onlyone result value in any case, so it is just a matter of what the type ofthat output value is, and how it should be handled.

Perhaps the flag parameter should be extended to allow retrieval of allthree payload formats instead of only two?

.get_payload could be converted to call the appropriate specific APIs,should it be desired to invent separate APIs for each payload format.

The problem is that if the bytes came off the wire, the parsercurrently can only attach the most basic MIME base class. Itdoesn't know that an image/png should create a MIMEImagePNG instancethere. This is different from hacking the model directly becausethe application can instantiate the right class. So the parsereither has to have a hookable way for an application to go fromcontent-type to class, or the generic MIME base class needs to behookable in its .decode() method.
So either the email package can stop at 3, and 4 only for text/*parts, or it could learn more types (registered types, withwell-defined corresponding objects could be potentially built-in tothe email package), and/or it could become hookable for applicationtypes. Of course, for disposition to files, storing the BLOB in afile of the right name is adequate... to avoid the file, I agree thatconverting to a useful object type is handy. But maybe file-likeobjects would suffice, for most of the types.
My own preferences here is that email does support #4 with aregistration system to handle returning concrete payload objects basedon the Content-Type.

Sure, a registration system is fine. It could work for any type thathas a method that can be registered, that accepts a binary BLOB andreturns an appropriate typed and functioning object that can manipulatethat type. That would mean that the application would have to make allthe registration calls up front, instead of making the API calls whenthe objects are retrieved. Basically, if the email package doesn't havea registration system that the application can use, the application hasto invent its own, so this is work that could benefit all applications.

I suppose the default registration for text/* would be to convert fromwhatever to Unicode, and the default registration for all otherContent-Type would be to pass back bytes(). Or maybe a few other commontypes, for which specific types are available, some specific image/*types, perhaps, that seems to have MIME types defined for them, althoughperhaps people may still prefer to register, say, a PIL type, forimages, so I agree the email package should only provide defaultregistrations. On the other hand, I'm not sure how the registrationsystem should work with threads, if different threads want differentregistrations...

Actually, although it is not common practice to have encodings otherthan the RFC defined base64 and quoted-printable, a registration systemfor converting from #1 to #3, with appropriate defaults for base64,quoted-printable, binary, 7bit, 8bit, would be appropriate, and wouldprovide a framework for allowing easy extensions to the encodings.Future mail RFCs may define some, but more likely, applications thatwish to use email transports, where both ends are applicationcontrolled, might wish to define other encodings... the RFCs do allowfor x-* encodings that are user defined. If a registration system iscreated for #3 to #4 encodings, the same mechanism could likely be usefor the registration system for #1 to #3 encodings, so there would beadded flexibility at very little cost.

I also think that the email package probably should not implement"store-payloads-on-disk" by default, although it may provide someexample implementations for simple applications (much the same waythere's wsgiref for simple applications).

Thinking about this, I agree that storing payloads on disk should not bethe default action. However, if an application wants to control itsmemory consumption, the receipt of a large email could negatively impactthat desire. It might be appropriate to place individual MIME parts ondisk, as they are parsed, if the application indicates a threshold partsize and/or threshold aggregate size, beyond which parts should beplaced in cache. Along with that, the temporary storage location inwhich to place them would have to be configured.

Still, that's different than say, storing attachments in a filenamed by the Content-Disposition header's filename parameter. Thatlatter is firmly in the domain of the application.

I again agree that this should not be the default action, but I assumethat an API should be provided such that an application could tell theemail package to place the content in the header's filename parameter.If such an API doesn't already exist, it seems it would be a helpfulextension, and if the part was already cached on disk because of theabove thresholds, the email package could possibly use rename instead offile copy to achieve the goal.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

Reply via email to