Re: [Email-SIG] fixing the current email module

Glenn Linderman Fri, 09 Oct 2009 12:40:51 -0700

On approximately 10/9/2009 5:23 AM, came the following characters fromthe keyboard of Barry Warsaw:

On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote:
On approximately 10/8/2009 4:40 AM, came the following charactersfrom the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:
> > > If conversions are avoided, then octets are unlikely to beout of > > > range?
> >
> > Haven't looked in your spam bucket recently, I guess.  Spammers
> > regularly put 8 bit characters into headers (and into bodies in
> > messages without a Content-Type header), for one thing.
> > I'm aware of that, but if conversions are not done, octets areunlikely > to be _reported_ to be out of range....
Conversions will eventually be done.  "Best it were done quickly."
Disagree. Deferring the conversions defers failure issues to thepoint where the code (hopefully) somewhat understands the type ofdata being manipulated, and can then handle it appropriately.Converting up front causes errors in things that may never be touchedor needed, so the error detection and handling is wasteful.
I'm with Stephen here. Remember, we're saying the parser should neverthrow an exception, so any such conversion exception happens when youmanipulate the model directly. That /has/ to error early becauseotherwise it is impossible to debug.

I suspect we are talking with different terminology somehow, here. Atleast it seems that way, between myself and Stephen. So let me returnto ground zero, and ask some very basic questions, to see what, ifanything, I am missing in my understanding of Stephen's and perhapsyour, terminology.

Let me speak in terms of parsing incoming wire-format messages, becausethe creation of a valid email from API calls should be straightforward.

I see the necessary job of the parser to received chunks of the message,parse the headers into individual headers (based mostly on CR LF TABdetection, and find the end of the headers. Then, in order to properlyhandle the body, it needs to find several specific headers, or supplydefaults for them if lacking. They include validation of theMIME-Version, determining the Content-Type, andContent-Transfer-Encoding. Other headers do not need to be decoded atparse time, if I understand things, just parsed into buckets (a list topreserve order, with possibly an index of some sort for performance ifnecessary). The 3 headers mentioned should be fully validated anddecoded, so that parsing the body can proceed. Parsing the body findsone or more MIME parts, and for each part, a list of its headers shouldbe created. Content-Type and Content-Transfer-Encoding should again befully validated and decoded, so that parsing the body of each part canproceed recursively. The leaf MIME parts should have their wire formatdata stored also.


Do you agree with that minimal functionality of message parsing?

If content boundaries cannot be found, then the parsing will fail, and adefect report generated for that part, and any higher-level parts thatinclude it, because they will also be incomplete. That is just aparse-error flag, in the tree of MIME parts, AFAICT.

I see the further validation and decoding of the MIME tree for themessage to be all based on API calls by the application to manipulatethe model, which should be able to raise exceptions as needed, and couldhave fully Pythonic interfaces.

If the client wishes to have all headers, header values, and charsetdecoding validated before doing model manipulations, then it should callemail package APIs that are provided to do that individually, per MIMEpart, or recursively over the model (and which might raise exceptions).

If the client wishes to have all leaf MIME parts decoded from wireformat to "raw payload" or "decoded payload", before manipulating themodel, then it should call the email package APIs that are provided todo that individually, per MIME part, or recursively over the model (andwhich might raise exceptions).

Is there any other functionality that should be performed? If so, why?It seems that Stephen is perhaps saying that the functionality in theabove two paragraphs should be performed during parsing. Is that what isbeing said? I can hardly believe it, if so. Since there are multipleways to interpret not-quite-perfect data, application guidance isrequired for those choices, and the creation of defect reports along theway would be a bookkeeping headache.

So for headers, which are supposed to be ASCII, or encoded via RFCrules to ASCII (no 8-bit chars), then the discovery of an 8-bit charshould be produce a defect report, but then simply converted toUnicode as if it were Latin-1 (since there is no other knowledgeavailable that could produce a better conversion). And if the resultof that is not expected by the client (your definition), then theclient should either notice the defect report and reject it based onthat, or attempt to parse it, and reject it if it encountersunexpected syntax. After all, this is, for that client, "raw userinput" (albeit from a remote source) so fully error checking theinput is appropriate.
Sure, but I can also think of lots of other things the client mightdo, including blowing away the header value and substituting theirown, doing the moral equivalent of a str.replace(), etc. etc. It'snot our job to decide. It our job to provide the highest fidelityinformation we can and the best APIs for clients to do what they want.

Exactly. So if the client is going to blow away the header value, nopoint to validate and decode it.

If the client is going to send it on, the client can choose to validatebefore sending, or just send what was received, whether or not it wasvalid. This depends on the purpose and functionality of the client.

The problem with the APIs that are spelled __str__ and __bytes__ isthat there is no other way to return errors other than exceptions....the Python way. Since the email library is trying to avoid raisingexceptions in large blocks of its code, it is non-Pythonic (which iswhat Oleg is probably complaining about, in part). But because itneeds to avoid exceptions, and is therefore non-Pythonic, it may beinappropriate to spell very many of its APIs __str__ and __bytes__,because that is Pythonic, and requires exceptions. Once you becomenon-Pythonic in one area, you may have to also be non-Pythonic insome other areas...
As was pointed out in a previous message, we shouldn't be tooconcerned with __str__ and __bytes__ right now. We'll designnon-magical APIs for everything and they'll do the right thing. We'llthen alias what seems appropriate as __str__ and __bytes__ and they'llbe as Pythonic as makes sense. When I say that, I'm thinking aboutthe semantic differences Message objects currently have in theirdict-like-plus API (which I still think makes perfect practical sense).

OK, it seems we all understand the limitations of the __str__,__bytes__, and assignment type APIs: they must either succeed, or raiseexceptions. Can we agree to that clients should only use such APIs whensuccess is assured, or raising exceptions is acceptable? And that if aclient complains about an exception in a case they thought successshould have been assured, that it is not a bug if they misunderstood?Clearly the email package should document the conditions for whichsuccess can be assured, if there are any... and that it is fair game toraise exceptions if those conditions are not met.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

Reply via email to