On approximately 10/8/2009 9:27 PM, came the following characters from the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:

 > > Conversions will eventually be done.  "Best it were done quickly."
> > Disagree. Deferring the conversions defers failure issues to the point > where the code (hopefully) somewhat understands the type of data being > manipulated, and can then handle it appropriately. Converting up front > causes errors in things that may never be touched or needed, so the > error detection and handling is wasteful.

That's theory; my position is based on Mailman practice.  Don't believe
me, ask Barry.  I also spend most of my OSS time on the
internationalization of XEmacs, and the experience is similar there.
Best to convert everything as early as possible, or admit that you
don't know how.

Emacs is different than email. Either you can read a file to edit it, or you can't. The Postel principle for email says to try to do the best you can, for as much as you can.

> So for headers, which are supposed to be ASCII, or encoded via RFC rules > to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be > produce a defect report, but then simply converted to Unicode as if it > were Latin-1 (since there is no other knowledge available that could > produce a better conversion).

No, that is already corruption.  Most clients will assume that string
is valid as a header, because it's valid as a string.

Sure it is corruption. That's why there is a defect report. But the conversion technique is appropriate, per the Postel principle.

 > And if the result of that is not expected by the client (your
 > definition), then the client should either notice the defect report
 > and reject it based on that, or attempt to parse it, and reject it
 > if it encounters unexpected syntax.  After all, this is, for that
 > client, "raw user input" (albeit from a remote source) so fully
 > error checking the input is appropriate.

No way.  That environment would suck to program in.  And it's
un-Pythonic: "Errors should never pass silently."

Then the Postel principle is un-Pythonic, and to be Pythonic any incorrect email should produce an error, and be unreadable. Again, I mentioned producing a defect report. That is not passing an error silently.

It is still raw user input, and should still be checked for proper syntax by the client, even if the email is well-formed and conversion produces no defect report. If you don't want to check proper syntax in your program inputs, I don't want to use your programs, they will be insecure.

> Python way. Since the email library is trying to avoid raising > exceptions in large blocks of its code, it is non-Pythonic

I disagree with that.  "Unless explicitly silenced."  The strategy
that Barry and I favor is to signal errors lazily.  So we *explicitly*
silence errors (at least of the Exception kind) when parsing.  If we
can't parse, we look for a part terminator, encapsulate the bad stuff
and move on to the rest of the input.  Later, at use time, *if* the
unparsable object is used, *then* the error will be raised, hopefully
with enough metainformation to figure out what to do about it.

So there seem to be two techniques:

1) convert quickly, but don't raise errors... instead metainformation structures that record the errors, and raise them later if the converted data is accessed. Because some kinds of not-quite-perfect data have alternate handling techniques, either all techniques must be performed and cached, or *some processing must be deferred until the client can decide*.

2) Store the data, and convert only if the data is accessed. When client accesses the data, the exceptions raised allow the client to choose an appropriate processing technique for handling the not-quite-perfect data, based on the context of the client, the importance of that data item, etc. Only the result of that technique need be cached for future accesses.

With both techniques, the data is given to the email library, and the errors are not seen until later... potentially the exact same user experience. But with the technique 1, much effort is expended to convert data, parse data, and create error metainformation ready to return IF the data is accessed. (yeah, don't say it, premature optmization -- I call it design, in this case) With technique 2, little effort is required to store the data, create a state variable to indicate whether it has been converted and parsed, or not, and then IF (and only IF) the data is accessed, the conversion and parsing must be done on the first access, and instead of creating and storing metainformation about the errors, they could just be raised.

I don't see what's un-Pythonic about that.

The un-Pythonic thing is returning defect reports instead of raising errors. There is no way for a simple assignment interface to return an error, because the API for simple assignment doesn't have an in-band signaling mechanism. No "condition code" left around to be checked. And programmers often omit checking condition codes anyway, due to laziness and hubris "nothing will go wrong with THIS statement". So the Pythonic way, AFAIU, is that errors are returned out-of-band via raised exceptions.

Perhaps this is why it is so hard to design a Pythonic interface to the Postel principle email handling... an out-of-band signalling system interrupts the flow of control, and the Postel principle wants to provide best-as-you-can data... and the easiest way to do Postel is to supply the not-quite-perfect data so the normal control flow can handle things, yet an out-of-band signal can't easily return to the normal control flow, and wrapping tiny try blocks around nearly every email API call is as annoying to the understanding of the control flow as putting all those if statements in the normal control flow to check "condition codes" (error codes, warning codes, defect reports, whatever you want to call them).

Stated another way, it is hard to process potentially not-quite-perfect data without writing complex code. And because the email library wants to simplify the handling of email, it wants to limit the complexity of the client code. But when dealing with not-quite-perfect data, there is a choice of different ways to handle it, and the email library doesn't know the best choice for any particular client application... if it did, then it could make the choices, and the client could be less complex.

The simplest client could be handed only perfectly structured, 100% accurately decodable email messages... its logic would be (simply, and Pythonically):

while 1:
   try:
       getEmail()
except: logBadEmailReceived
   else:
      processEmail()

In order to allow defect reports to be useful, the client logic must be more complex; getEmail must be expanded to make decisions based on the content of the defect reports. More try statements must be used, at a finer granularity, or more if statements to check defect reports. The former is more Pythonic, the latter less, AFAIU.

Perhaps a given client knows how it wants to handle all types of not-quite-perfect data -- should the email library allow rules to be set, so that when a situation arises, it can handle it according to the rules? This simplifies the client logic, at the cost of initialization setup, rules creation and caching, documenting the rules, adding the new APIs that don't seem to exist in today's email library. While this could perhaps simplify many clients, it cannot simplify the email library... it still has to have the code for all the variant perfect and not-quite-perfect data handling techniques, plus the complexity of rule definition and usage.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to