Re: [Email-SIG] fixing the current email module

Glenn Linderman Fri, 09 Oct 2009 13:28:48 -0700

On approximately 10/9/2009 8:10 AM, came the following characters fromthe keyboard of Stephen J. Turnbull:

Glenn Linderman writes:

> Emacs is different than email. Either you can read a file to edit it,> or you can't.


*sigh* Emacs is as powerful a programming environment as Python, and
applications regularly deal with network streams (HTTP, NNTP, and SMTP
most commonly, but also raw X protocol and any kind of socket
supported by the platform).  So, yes, it's different from email,
because it's *far* more general.  That's precisely why I appreciate
Bill's concerns about non-email usage.

OK, yes, Emacs is an operating system. I am an Emacs user. And yes, Iknow Emacs can read email (I used it to read and write email, but foundit seriously lacking for the way I handle email, and annoying that theemail buffers and edit buffers were all in the same buffer pool, and Iquit using it for email). And I know it can be programmed, and I'vedone a little of that, but I hate Lisp, so I mostly Google for thepackages that do what I need, and don't try to create my own.

 > The Postel principle for email says to try to do the best you can,
 > for as much as you can.

Actually, it doesn't.  It says be lenient in what you accept, strict
in what you emit.  You accept it ... but you don't have to do
anything with it except preserve it verbatim for whoever wants it.

Yes, that is what it says, I agree. But unless you do the best you can,for as much as you can, no one is going to want it, so they arebasically the same.

> > > produce a defect report, but then simply converted to Unicode as if it> > > were Latin-1 (since there is no other knowledge available that could> > > produce a better conversion).
 > >
 > > No, that is already corruption.  Most clients will assume that string
 > > is valid as a header, because it's valid as a string.
>> Sure it is corruption. That's why there is a defect report. But
 > the conversion technique is appropriate, per the Postel principle.

Actually, I would say you are emitting leniently, in violation of the
Postel principle.

You can say that, but I don't have to believe it. I'm talking aboutaccepting; the message has arrived, it is here, the client is trying tolook at it, and I'm talking about ways the client can look atnot-quite-perfect data, knowing that it is not quite perfect, but stillbeing able to see it. I'm not at all talking about emitting data. Youseem to be calling the email package helping the client to acceptnot-quite-perfect data, as a form of emitting data. It is not.

You don't know what the client will do, they may
eat it in a single gulp without looking at it.  Thus you should avoid
converting anything that you don't know what it is (unless
specifically asked to do your best).

The email package cannot police the client... if it chooses to "eat itin a single gulp without looking at it" then it may get indigestion. Inever suggested that "converting to Unicode as if it were Latin-1"should be done without informing the client, or being requested by theclient to do that via a special API call... I was only talking about anappropriate method of doing conversions in the presence ofnot-quite-perfect data input, so that the client, and possibly even ahuman, can try to make some sense out of the not-quite-perfect data.

 > Again, I mentioned producing a defect report.  That is not passing
 > an error silently.

But if I access that Unicode object without looking at the defect
report, you *will* pass the error silently.  OTOH, if I look at the
defect report, I won't access the Unicode object.

If those are the only two choices you see, then you are not doing yourwhole job.

If you ignore defect reports, you are ignorant (blunt, but not intendedto be offensive).If you treat all defect reports as fatal errors, then you are not beinglenient in what you accept (non-Postel).

> It is still raw user input, and should still be checked for proper> syntax by the client,
Nonsense.  The email module had better know a lot more about syntax
than the client.  If it doesn't, whack it with a 2x4 until it learns!

I think we are talking at cross purposes here. I find it quitedifficult to follow where you cross the boundary between talking aboutone sort of email package client, and then switch to another type, orswitch to the responsibilities of the email package.

A client which is an MUA is just going to present the best possible datato a human user, and is done. A client with is an email archiverpreserves the data for presenting via other MUAs.An application which is using email as a transport, has specific goals,which require specific content. You were mentioning clients. It isthis sort of client I thought you were talking about, and about which Iresponded to. If such a client doesn't validate the syntax of thatcontent, it isn't much of an application. The email module does not,and cannot, understand the application domain; it can only validate thatthe message has proper (or improper) structure. The transported contentis fully the responsibility of the application to validate, parse, andmanipulate. The email module may detect if the transport cause garblingin the structure of the message, and may be able to warn the applicationabout such garbling. But that may not prevent the application fromfinding its content within even a garbled email, and so it may still beable to validate, parse, and manipulate that content. Such clients maytransfer content either in headers or in MIME parts... in any case,whatever client specific content is expected in those headers or MIMEparts should be validated by the client.

> produces no defect report. If you don't want to check proper syntax in> your program inputs, I don't want to use your programs, they will be> insecure.
So you're saying that every program that uses the email module should
reproduce 100% of the functionality of the email module's parser, or
it's insecure.  And you imply that's an excuse for passing corrupt
data to any client that asks for it.

I disagree.

I'm glad you disagree with what you thought I was saying, because thatisn't what I was saying, and I also disagree with your paraphrase ofwhat I was saying. The email package should parse email. Where itfinds not-quite-perfect data, the client may get involved to choose apath for interpreting the not-quite-perfect data... or to reject thenot-quite-perfect data.

Once the data from the email is discovered, then the client must operateon the data. An MUA would simply display it to a human. Other clientswould attempt to interpret the content. The interpretation of thecontent requires the client to parse, validate the syntax of, andmanipulate the content. An example would be a program that doesappointments via email. If it finds an appointment in a known format,it enters it into the calendar. The email package knows nothing aboutappointments or calendars (of the sort that hold appointments). Itcannot help, only the client can do that part of the job.

 > So there seem to be two techniques:

Whatever gave you that idea?


I'm not sure you what you are asking here.

 > 2) Store the data, and convert only if the data is accessed.

 > With technique 2, little effort is required to store the data,
 > create a state variable to indicate whether it has been converted

Why do that?  It's always "False" in technique 2.

The first time it is always false. Subsequent requests can leverage thework done by the first request, if results were created and cached.

 > and parsed, or not, and then IF (and only IF) the data is accessed,
 > the conversion and parsing must be done on the first access, and
 > instead of creating and storing metainformation about the errors,
 > they could just be raised.

No, they cannot just be raised.  If you just raise the error, then the
next time you try to access unparsed data, you'll hit the error
again.  If you use the same handler you did before, you're in an
infloop.  So you need a second handler to do things differently this
time or a flag ... but it's unclear to me that that flag can be a
boolean.  So you may as well store the defect list and information
about where to restart.

From the point of view of the email package, the errors can just beraised. Then the client can make choices, and use other APIs or otherparameters to the API to direct the email package to attempt a differenttechnique to access the data. If the technique is successful, thenprogress is made. If unsuccessful, another error is raised by thedifferent technique. If there are more techniques, repeat. When out oftechniques, and no success, then the client needs to remember (possiblywith the help of APIs of the email package) that it cannot interpretthis data in a useful manner. If it then continues to attempt to accessthe data using failed techniques, and goes into an infinite loop, thenthe client has a bug.

 > So the Pythonic way, AFAIU, is that errors are returned out-of-band
 > via raised exceptions.

Sure.  But what you're missing is that "Neither rain, nor snow, nor

dark of night may stop the Parser on her appointed rounds."

I haven't forgotten that, but clearly we haven't been communicatingeffectively. That may be partly my fault, partly because I'm relativelynew to Python and to the email package (having only experimented with itusing Python 2.6, not coded inside it, to date), but I'm trying... I'mhoping to write some email processing programs using the Python emailpackage, and so I do have a strong interest in this topic. I'm hoping Idon't have to start from scratch and write my own email package, becausePython's isn't functional enough, or doesn't perform well enough. Beingnew to Python, I've chosen to focus on building my applications withPython 3, understanding that there are fewer fully functional pieces inthat arena to date, and since email is one that has some rough edgesbecause of the Unicode strings, I'm trying to participate where I can.

It is not
easy to write parsers, but I'll tell you one thing: it's orders of
magnitude harder to write a parser that starts in the middle and works
outward, than one that starts at the beginning and works forward to
the end.


Yes, I have learned that in my 34 years of programming.  I agree.

So it's OK to write a lazy parser, but it must retain enough state so
that it can work forward until the end.  Because you don't know that
the client will not request the last character of the message, you
need to be able to try to get it, no matter what happened to the first
10GB of the message.  And if an exception occurs, it must be handled
by the parser itself; if not, you put the poor thing in the position
of starting over at the beginning (that way lies the madness of
infloops), or trying to start a parse in the middle and work out.

Are you speaking about parsing the message into MIME parts, or parsing aparticular MIME part contained within the message, or both?


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

Reply via email to