On approximately 10/8/2009 9:27 PM, came the following characters from
the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:
> > Conversions will eventually be done. "Best it were done quickly."
>
> Disagree. Deferring the conversions defers failure issues to the point
> where the code (hopefully) somewhat understands the type of data being
> manipulated, and can then handle it appropriately. Converting up front
> causes errors in things that may never be touched or needed, so the
> error detection and handling is wasteful.
That's theory; my position is based on Mailman practice. Don't believe
me, ask Barry. I also spend most of my OSS time on the
internationalization of XEmacs, and the experience is similar there.
Best to convert everything as early as possible, or admit that you
don't know how.
Emacs is different than email. Either you can read a file to edit it,
or you can't.
The Postel principle for email says to try to do the best you can, for
as much as you can.
> So for headers, which are supposed to be ASCII, or encoded via RFC rules
> to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be
> produce a defect report, but then simply converted to Unicode as if it
> were Latin-1 (since there is no other knowledge available that could
> produce a better conversion).
No, that is already corruption. Most clients will assume that string
is valid as a header, because it's valid as a string.
Sure it is corruption. That's why there is a defect report. But the
conversion technique is appropriate, per the Postel principle.
> And if the result of that is not expected by the client (your
> definition), then the client should either notice the defect report
> and reject it based on that, or attempt to parse it, and reject it
> if it encounters unexpected syntax. After all, this is, for that
> client, "raw user input" (albeit from a remote source) so fully
> error checking the input is appropriate.
No way. That environment would suck to program in. And it's
un-Pythonic: "Errors should never pass silently."
Then the Postel principle is un-Pythonic, and to be Pythonic any
incorrect email should produce an error, and be unreadable. Again, I
mentioned producing a defect report. That is not passing an error silently.
It is still raw user input, and should still be checked for proper
syntax by the client, even if the email is well-formed and conversion
produces no defect report. If you don't want to check proper syntax in
your program inputs, I don't want to use your programs, they will be
insecure.
> Python way. Since the email library is trying to avoid raising
> exceptions in large blocks of its code, it is non-Pythonic
I disagree with that. "Unless explicitly silenced." The strategy
that Barry and I favor is to signal errors lazily. So we *explicitly*
silence errors (at least of the Exception kind) when parsing. If we
can't parse, we look for a part terminator, encapsulate the bad stuff
and move on to the rest of the input. Later, at use time, *if* the
unparsable object is used, *then* the error will be raised, hopefully
with enough metainformation to figure out what to do about it.
So there seem to be two techniques:
1) convert quickly, but don't raise errors... instead metainformation
structures that record the errors, and raise them later if the converted
data is accessed. Because some kinds of not-quite-perfect data have
alternate handling techniques, either all techniques must be performed
and cached, or *some processing must be deferred until the client can
decide*.
2) Store the data, and convert only if the data is accessed. When
client accesses the data, the exceptions raised allow the client to
choose an appropriate processing technique for handling the
not-quite-perfect data, based on the context of the client, the
importance of that data item, etc. Only the result of that technique
need be cached for future accesses.
With both techniques, the data is given to the email library, and the
errors are not seen until later... potentially the exact same user
experience. But with the technique 1, much effort is expended to
convert data, parse data, and create error metainformation ready to
return IF the data is accessed. (yeah, don't say it, premature
optmization -- I call it design, in this case) With technique 2, little
effort is required to store the data, create a state variable to
indicate whether it has been converted and parsed, or not, and then IF
(and only IF) the data is accessed, the conversion and parsing must be
done on the first access, and instead of creating and storing
metainformation about the errors, they could just be raised.
I don't see what's un-Pythonic about that.
The un-Pythonic thing is returning defect reports instead of raising
errors. There is no way for a simple assignment interface to return an
error, because the API for simple assignment doesn't have an in-band
signaling mechanism. No "condition code" left around to be checked.
And programmers often omit checking condition codes anyway, due to
laziness and hubris "nothing will go wrong with THIS statement". So the
Pythonic way, AFAIU, is that errors are returned out-of-band via raised
exceptions.
Perhaps this is why it is so hard to design a Pythonic interface to the
Postel principle email handling... an out-of-band signalling system
interrupts the flow of control, and the Postel principle wants to
provide best-as-you-can data... and the easiest way to do Postel is to
supply the not-quite-perfect data so the normal control flow can handle
things, yet an out-of-band signal can't easily return to the normal
control flow, and wrapping tiny try blocks around nearly every email API
call is as annoying to the understanding of the control flow as putting
all those if statements in the normal control flow to check "condition
codes" (error codes, warning codes, defect reports, whatever you want to
call them).
Stated another way, it is hard to process potentially not-quite-perfect
data without writing complex code. And because the email library wants
to simplify the handling of email, it wants to limit the complexity of
the client code. But when dealing with not-quite-perfect data, there is
a choice of different ways to handle it, and the email library doesn't
know the best choice for any particular client application... if it did,
then it could make the choices, and the client could be less complex.
The simplest client could be handed only perfectly structured, 100%
accurately decodable email messages... its logic would be (simply, and
Pythonically):
while 1:
try:
getEmail()
except:
logBadEmailReceived
else:
processEmail()
In order to allow defect reports to be useful, the client logic must be
more complex; getEmail must be expanded to make decisions based on the
content of the defect reports. More try statements must be used, at a
finer granularity, or more if statements to check defect reports. The
former is more Pythonic, the latter less, AFAIU.
Perhaps a given client knows how it wants to handle all types of
not-quite-perfect data -- should the email library allow rules to be
set, so that when a situation arises, it can handle it according to the
rules? This simplifies the client logic, at the cost of initialization
setup, rules creation and caching, documenting the rules, adding the new
APIs that don't seem to exist in today's email library. While this
could perhaps simplify many clients, it cannot simplify the email
library... it still has to have the code for all the variant perfect and
not-quite-perfect data handling techniques, plus the complexity of rule
definition and usage.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options:
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com