On approximately 10/9/2009 8:10 AM, came the following characters from the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:

> Emacs is different than email. Either you can read a file to edit it, > or you can't.

*sigh* Emacs is as powerful a programming environment as Python, and
applications regularly deal with network streams (HTTP, NNTP, and SMTP
most commonly, but also raw X protocol and any kind of socket
supported by the platform).  So, yes, it's different from email,
because it's *far* more general.  That's precisely why I appreciate
Bill's concerns about non-email usage.

OK, yes, Emacs is an operating system. I am an Emacs user. And yes, I know Emacs can read email (I used it to read and write email, but found it seriously lacking for the way I handle email, and annoying that the email buffers and edit buffers were all in the same buffer pool, and I quit using it for email). And I know it can be programmed, and I've done a little of that, but I hate Lisp, so I mostly Google for the packages that do what I need, and don't try to create my own.

 > The Postel principle for email says to try to do the best you can,
 > for as much as you can.

Actually, it doesn't.  It says be lenient in what you accept, strict
in what you emit.  You accept it ... but you don't have to do
anything with it except preserve it verbatim for whoever wants it.

Yes, that is what it says, I agree. But unless you do the best you can, for as much as you can, no one is going to want it, so they are basically the same.

> > > produce a defect report, but then simply converted to Unicode as if it > > > were Latin-1 (since there is no other knowledge available that could > > > produce a better conversion).
 > >
 > > No, that is already corruption.  Most clients will assume that string
 > > is valid as a header, because it's valid as a string.
> > Sure it is corruption. That's why there is a defect report. But
 > the conversion technique is appropriate, per the Postel principle.

Actually, I would say you are emitting leniently, in violation of the
Postel principle.

You can say that, but I don't have to believe it. I'm talking about accepting; the message has arrived, it is here, the client is trying to look at it, and I'm talking about ways the client can look at not-quite-perfect data, knowing that it is not quite perfect, but still being able to see it. I'm not at all talking about emitting data. You seem to be calling the email package helping the client to accept not-quite-perfect data, as a form of emitting data. It is not.

You don't know what the client will do, they may
eat it in a single gulp without looking at it.  Thus you should avoid
converting anything that you don't know what it is (unless
specifically asked to do your best).


The email package cannot police the client... if it chooses to "eat it in a single gulp without looking at it" then it may get indigestion. I never suggested that "converting to Unicode as if it were Latin-1" should be done without informing the client, or being requested by the client to do that via a special API call... I was only talking about an appropriate method of doing conversions in the presence of not-quite-perfect data input, so that the client, and possibly even a human, can try to make some sense out of the not-quite-perfect data.


 > Again, I mentioned producing a defect report.  That is not passing
 > an error silently.

But if I access that Unicode object without looking at the defect
report, you *will* pass the error silently.  OTOH, if I look at the
defect report, I won't access the Unicode object.

If those are the only two choices you see, then you are not doing your whole job.

If you ignore defect reports, you are ignorant (blunt, but not intended to be offensive). If you treat all defect reports as fatal errors, then you are not being lenient in what you accept (non-Postel).

> It is still raw user input, and should still be checked for proper > syntax by the client,

Nonsense.  The email module had better know a lot more about syntax
than the client.  If it doesn't, whack it with a 2x4 until it learns!

I think we are talking at cross purposes here. I find it quite difficult to follow where you cross the boundary between talking about one sort of email package client, and then switch to another type, or switch to the responsibilities of the email package.

A client which is an MUA is just going to present the best possible data to a human user, and is done. A client with is an email archiver preserves the data for presenting via other MUAs. An application which is using email as a transport, has specific goals, which require specific content. You were mentioning clients. It is this sort of client I thought you were talking about, and about which I responded to. If such a client doesn't validate the syntax of that content, it isn't much of an application. The email module does not, and cannot, understand the application domain; it can only validate that the message has proper (or improper) structure. The transported content is fully the responsibility of the application to validate, parse, and manipulate. The email module may detect if the transport cause garbling in the structure of the message, and may be able to warn the application about such garbling. But that may not prevent the application from finding its content within even a garbled email, and so it may still be able to validate, parse, and manipulate that content. Such clients may transfer content either in headers or in MIME parts... in any case, whatever client specific content is expected in those headers or MIME parts should be validated by the client.


> produces no defect report. If you don't want to check proper syntax in > your program inputs, I don't want to use your programs, they will be > insecure.

So you're saying that every program that uses the email module should
reproduce 100% of the functionality of the email module's parser, or
it's insecure.  And you imply that's an excuse for passing corrupt
data to any client that asks for it.

I disagree.

I'm glad you disagree with what you thought I was saying, because that isn't what I was saying, and I also disagree with your paraphrase of what I was saying. The email package should parse email. Where it finds not-quite-perfect data, the client may get involved to choose a path for interpreting the not-quite-perfect data... or to reject the not-quite-perfect data.

Once the data from the email is discovered, then the client must operate on the data. An MUA would simply display it to a human. Other clients would attempt to interpret the content. The interpretation of the content requires the client to parse, validate the syntax of, and manipulate the content. An example would be a program that does appointments via email. If it finds an appointment in a known format, it enters it into the calendar. The email package knows nothing about appointments or calendars (of the sort that hold appointments). It cannot help, only the client can do that part of the job.


 > So there seem to be two techniques:

Whatever gave you that idea?

I'm not sure you what you are asking here.

 > 2) Store the data, and convert only if the data is accessed.

 > With technique 2, little effort is required to store the data,
 > create a state variable to indicate whether it has been converted

Why do that?  It's always "False" in technique 2.

The first time it is always false. Subsequent requests can leverage the work done by the first request, if results were created and cached.

 > and parsed, or not, and then IF (and only IF) the data is accessed,
 > the conversion and parsing must be done on the first access, and
 > instead of creating and storing metainformation about the errors,
 > they could just be raised.

No, they cannot just be raised.  If you just raise the error, then the
next time you try to access unparsed data, you'll hit the error
again.  If you use the same handler you did before, you're in an
infloop.  So you need a second handler to do things differently this
time or a flag ... but it's unclear to me that that flag can be a
boolean.  So you may as well store the defect list and information
about where to restart.

From the point of view of the email package, the errors can just be raised. Then the client can make choices, and use other APIs or other parameters to the API to direct the email package to attempt a different technique to access the data. If the technique is successful, then progress is made. If unsuccessful, another error is raised by the different technique. If there are more techniques, repeat. When out of techniques, and no success, then the client needs to remember (possibly with the help of APIs of the email package) that it cannot interpret this data in a useful manner. If it then continues to attempt to access the data using failed techniques, and goes into an infinite loop, then the client has a bug.


 > So the Pythonic way, AFAIU, is that errors are returned out-of-band
 > via raised exceptions.

Sure.  But what you're missing is that "Neither rain, nor snow, nor
dark of night may stop the Parser on her appointed rounds."

I haven't forgotten that, but clearly we haven't been communicating effectively. That may be partly my fault, partly because I'm relatively new to Python and to the email package (having only experimented with it using Python 2.6, not coded inside it, to date), but I'm trying... I'm hoping to write some email processing programs using the Python email package, and so I do have a strong interest in this topic. I'm hoping I don't have to start from scratch and write my own email package, because Python's isn't functional enough, or doesn't perform well enough. Being new to Python, I've chosen to focus on building my applications with Python 3, understanding that there are fewer fully functional pieces in that arena to date, and since email is one that has some rough edges because of the Unicode strings, I'm trying to participate where I can.

It is not
easy to write parsers, but I'll tell you one thing: it's orders of
magnitude harder to write a parser that starts in the middle and works
outward, than one that starts at the beginning and works forward to
the end.

Yes, I have learned that in my 34 years of programming.  I agree.

So it's OK to write a lazy parser, but it must retain enough state so
that it can work forward until the end.  Because you don't know that
the client will not request the last character of the message, you
need to be able to try to get it, no matter what happened to the first
10GB of the message.  And if an exception occurs, it must be handled
by the parser itself; if not, you put the poor thing in the position
of starting over at the beginning (that way lies the madness of
infloops), or trying to start a parse in the middle and work out.

Are you speaking about parsing the message into MIME parts, or parsing a particular MIME part contained within the message, or both?

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to