On approximately 10/9/2009 5:23 AM, came the following characters from the keyboard of Barry Warsaw:
On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote:

On approximately 10/8/2009 4:40 AM, came the following characters from the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:

> > > If conversions are avoided, then octets are unlikely to be out of > > > range?
> >
> > Haven't looked in your spam bucket recently, I guess.  Spammers
> > regularly put 8 bit characters into headers (and into bodies in
> > messages without a Content-Type header), for one thing.
> > I'm aware of that, but if conversions are not done, octets are unlikely > to be _reported_ to be out of range....

Conversions will eventually be done.  "Best it were done quickly."


Disagree. Deferring the conversions defers failure issues to the point where the code (hopefully) somewhat understands the type of data being manipulated, and can then handle it appropriately. Converting up front causes errors in things that may never be touched or needed, so the error detection and handling is wasteful.

I'm with Stephen here. Remember, we're saying the parser should never throw an exception, so any such conversion exception happens when you manipulate the model directly. That /has/ to error early because otherwise it is impossible to debug.

I suspect we are talking with different terminology somehow, here. At least it seems that way, between myself and Stephen. So let me return to ground zero, and ask some very basic questions, to see what, if anything, I am missing in my understanding of Stephen's and perhaps your, terminology.

Let me speak in terms of parsing incoming wire-format messages, because the creation of a valid email from API calls should be straightforward.

I see the necessary job of the parser to received chunks of the message, parse the headers into individual headers (based mostly on CR LF TAB detection, and find the end of the headers. Then, in order to properly handle the body, it needs to find several specific headers, or supply defaults for them if lacking. They include validation of the MIME-Version, determining the Content-Type, and Content-Transfer-Encoding. Other headers do not need to be decoded at parse time, if I understand things, just parsed into buckets (a list to preserve order, with possibly an index of some sort for performance if necessary). The 3 headers mentioned should be fully validated and decoded, so that parsing the body can proceed. Parsing the body finds one or more MIME parts, and for each part, a list of its headers should be created. Content-Type and Content-Transfer-Encoding should again be fully validated and decoded, so that parsing the body of each part can proceed recursively. The leaf MIME parts should have their wire format data stored also.

Do you agree with that minimal functionality of message parsing?

If content boundaries cannot be found, then the parsing will fail, and a defect report generated for that part, and any higher-level parts that include it, because they will also be incomplete. That is just a parse-error flag, in the tree of MIME parts, AFAICT.

I see the further validation and decoding of the MIME tree for the message to be all based on API calls by the application to manipulate the model, which should be able to raise exceptions as needed, and could have fully Pythonic interfaces.

If the client wishes to have all headers, header values, and charset decoding validated before doing model manipulations, then it should call email package APIs that are provided to do that individually, per MIME part, or recursively over the model (and which might raise exceptions).

If the client wishes to have all leaf MIME parts decoded from wire format to "raw payload" or "decoded payload", before manipulating the model, then it should call the email package APIs that are provided to do that individually, per MIME part, or recursively over the model (and which might raise exceptions).

Is there any other functionality that should be performed? If so, why? It seems that Stephen is perhaps saying that the functionality in the above two paragraphs should be performed during parsing. Is that what is being said? I can hardly believe it, if so. Since there are multiple ways to interpret not-quite-perfect data, application guidance is required for those choices, and the creation of defect reports along the way would be a bookkeeping headache.

So for headers, which are supposed to be ASCII, or encoded via RFC rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be produce a defect report, but then simply converted to Unicode as if it were Latin-1 (since there is no other knowledge available that could produce a better conversion). And if the result of that is not expected by the client (your definition), then the client should either notice the defect report and reject it based on that, or attempt to parse it, and reject it if it encounters unexpected syntax. After all, this is, for that client, "raw user input" (albeit from a remote source) so fully error checking the input is appropriate.

Sure, but I can also think of lots of other things the client might do, including blowing away the header value and substituting their own, doing the moral equivalent of a str.replace(), etc. etc. It's not our job to decide. It our job to provide the highest fidelity information we can and the best APIs for clients to do what they want.

Exactly. So if the client is going to blow away the header value, no point to validate and decode it.

If the client is going to send it on, the client can choose to validate before sending, or just send what was received, whether or not it was valid. This depends on the purpose and functionality of the client.


The problem with the APIs that are spelled __str__ and __bytes__ is that there is no other way to return errors other than exceptions.... the Python way. Since the email library is trying to avoid raising exceptions in large blocks of its code, it is non-Pythonic (which is what Oleg is probably complaining about, in part). But because it needs to avoid exceptions, and is therefore non-Pythonic, it may be inappropriate to spell very many of its APIs __str__ and __bytes__, because that is Pythonic, and requires exceptions. Once you become non-Pythonic in one area, you may have to also be non-Pythonic in some other areas...

As was pointed out in a previous message, we shouldn't be too concerned with __str__ and __bytes__ right now. We'll design non-magical APIs for everything and they'll do the right thing. We'll then alias what seems appropriate as __str__ and __bytes__ and they'll be as Pythonic as makes sense. When I say that, I'm thinking about the semantic differences Message objects currently have in their dict-like-plus API (which I still think makes perfect practical sense).

OK, it seems we all understand the limitations of the __str__, __bytes__, and assignment type APIs: they must either succeed, or raise exceptions. Can we agree to that clients should only use such APIs when success is assured, or raising exceptions is acceptable? And that if a client complains about an exception in a case they thought success should have been assured, that it is not a bug if they misunderstood? Clearly the email package should document the conditions for which success can be assured, if there are any... and that it is fair game to raise exceptions if those conditions are not met.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to