On approximately 10/10/2009 6:59 AM, came the following characters from
the keyboard of Stephen J. Turnbull:
I'm running out of time to work on this (yeah, I know it's the
weekend, but my life is like that lately). I think we're converging,
though, so I'd like try and tie some of those ends together.
I think we are converging too... mostly terminology issues, and
assumptions were causing a bit of misunderstandings.
Glenn Linderman writes:
> On approximately 10/9/2009 8:10 AM, came the following characters from
> the keyboard of Stephen J. Turnbull:
> > Actually, I would say you are emitting leniently, in violation of the
> > Postel principle.
>
> You can say that, but I don't have to believe it. I'm talking about
> accepting; the message has arrived, it is here, the client is trying to
> look at it, and I'm talking about ways the client can look at
> not-quite-perfect data, knowing that it is not quite perfect, but still
> being able to see it. I'm not at all talking about emitting data.
It would be indeed, if the corrupt data is stored in the place where
correctly decoded data normally is stored, and is accessible in the
same way. But I gather that's not what you were talking about, my
mistake.
Well, the client tells us where to store it, and we can't prevent it
from being the same place. But accessible in the same way? Not. Some
extra parameter or different API, would surely be required to get
not-quite-perfect data.
> You seem to be calling the email package helping the client to
> accept not-quite-perfect data, as a form of emitting data. It is
> not.
No, I was confused by the way you wrote. Saving the data *somewhere*
is absolutely necessary; not losing data is the #1 commandment of
low-level mail processing. Surely the email module is subject to that
commandment. *Nobody* is talking about losing any data yet, except
Barry indirectly when he says that some people think giving up on
invertibility (often called "idempotency"), and even he is quite
adamant that he's not going to give up on that.
So when you wrote about saving and converting to text form, without
mentioning that the specific APIs, I assumed you meant the "mainline"
APIs for parsing and accessing parts of a correctly formatted message.
Mostly, I hadn't bothered about APIs yet; I'm not yet very familiar with
the existing ones, because neither nPOPuk nor SeaMonkey nor Thunderbird,
the only email programs that I have looked at source code for, use the
Python email package! So while I'm reasonably familiar with the RFCs,
and quite familiar with nPOPuk source, and have looked at a small
fraction of the SeaMonkey/Thunderbird source code (and been amazed at
how big it is), and have examined email from a large variety of sources
comparing it to the RFCs to see where it goes wrong and why it doesn't
display in SeaMonkey/Thunderbird the same way as in Outlook/Outlook
Express (or other programs), and have found Outlook 2000 and Apple Mail
to be quite creative in interpreting the RFCs, I'm new to the Python
email package.
> The email package cannot police the client... if it chooses to "eat it
> in a single gulp without looking at it" then it may get indigestion. I
> never suggested that "converting to Unicode as if it were Latin-1"
> should be done without informing the client, or being requested by the
> client to do that via a special API call...
Well, maybe I misread it, but it certainly looked like that to me. I
would not object to that special API call defaulting to ISO 8859/1.
> If you ignore defect reports, you are ignorant (blunt, but not intended
> to be offensive).
What I worried about is that if defect reports are present, *but
displayable data is also present*, programmers *will* simply display
it, for example in producing a prototype program. It will be
impossible to determine without very close analysis of that program
that an early version became a production version without adding
appropriate checks. In practice, this bug will be discovered when
some end user's installation breaks.
It seems that you agree with this, and because the special API call is
necessary, it will be easy to identify whether proper care is being
taken or not. Right?
Well, yes and no.
I think that the email package should require that some special action
needs to be taken by the client to request not-quite-perfect data,
either a special parameter value, or different API, etc. But there is
nothing that says that some client might not pass that all the time, and
ignore the defect reports. Whether that is easy to identify or not, and
whether the email package wants to require that the normal APIs be tried
before the not-quite-perfect APIs are issues for discussion.
Ultimately, the email package cannot enforce that proper case is taken
by the client; only code reviews of the client can encourage that.
> > > It is still raw user input, and should still be checked for proper
> > > syntax by the client,
> >
> > Nonsense. The email module had better know a lot more about syntax
> > than the client. If it doesn't, whack it with a 2x4 until it learns!
>
> I think we are talking at cross purposes here. I find it quite
> difficult to follow where you cross the boundary between talking about
> one sort of email package client, and then switch to another type, or
> switch to the responsibilities of the email package.
Excuse me? The "raw user input" you referred to above is material
that the client software receives from the email package. The email
package should give it to the client in the "normal" (convenient) way
only if it can certify that it conforms to the appropriate standard.
Yes, agreed. And a special way or ways to get various algorithms for
attempting to interpret not-quite-perfect data, when the client thinks
that might be useful. Then the client has "tweaked" user input.
That standard should be specified in the API documentation. Any more
detailed structure, of course, is the responsibility of the client.
Right. And it is the more detailed structure that I was referring to...
Even if the structure of the email is incorrect, if the client can find
its input among the various attempts to obtain data from the
not-quite-perfect email message, and can validate and check its input,
it may choose to process it even if the email message is imperfect... it
should probably note somewhere that the email message from which the
data was obtained was not perfect, but really, that is up to the client
to figure out, based on its requirements.
> An application which is using email as a transport, has specific goals,
> which require specific content. You were mentioning clients.
I've already said that when I speak of an MUA, I write "MUA". In
speaking of the calling program, which might even be a user running
the module via the Python interpreter, I write "client". It's a very
convenient way to describe the user of an API, in contrast to the
provider of the API (the implementation).
Yep, so I think my "application" and your "client" are the same thing.
I'm trying to use your term as I continue responding in these threads,
it is reasonable.
> If such a client doesn't validate the syntax of that content, it
> isn't much of an application.
If that MUA or email application uses RFC 822 addresses, it should be
able to rely on the email module to parse those addresses correctly,
or provide a defect report. One might even go so far as to suggest
that it be able to parse the (non-RFC, but very common) "+" notation
for separating the "mailbox" from "additional data" used for VERP and
challenge-response applications. That would have to be documented,
but if so documented client applications like the MUA should be able
to rely on it (and you can bet many will).
Hmim. This is an interesting digression...
"+", according to the RFCs, is just another of the legal characters that
can be found before the @ in an unquoted email address... the list is
!#$%&'*+-/=?^_`{}|~ in addition to the alphanumerics.
How a particular email server interprets the "stuff before the @" is
pretty much up to it... so as long as it does something appropriate, it
can interpret all or a fraction of it as a mailbox name, or could it
intuit a mailbox name from the body content if it wants, or even from a
special header. So yeah, particular interpretations of the address is
non-RFC stuff.
Application domain syntax of course is not the email module's problem
whether it arrives by email or Pony Express, and I'm really confused
why you're going so far afield.
Just to point out that good data can be obtained from bad email
messages, I think, and that that is a use case.
> > No, they cannot just be raised. If you just raise the error, then the
> > next time you try to access unparsed data, you'll hit the error
> > again. If you use the same handler you did before, you're in an
> > infloop. So you need a second handler to do things differently this
> > time or a flag ... but it's unclear to me that that flag can be a
> > boolean. So you may as well store the defect list and information
> > about where to restart.
>
> From the point of view of the email package, the errors can just be
> raised. Then the client can make choices, and use other APIs or other
> parameters to the API to direct the email package to attempt a different
> technique to access the data.
The problem is that by this point some of the state of the parse may
be lost. We can't say "just raise", we need to say "interrupt the
parse, preserve state, and then raise". Python does absolutely
nothing to help with the problem of preserving the state. We also
need to determine just what state to preserve.
> Yes, I have learned that in my 34 years of programming. I agree.
>
> > So it's OK to write a lazy parser, but it must retain enough state so
> > that it can work forward until the end. [...]
>
> Are you speaking about parsing the message into MIME parts, or parsing a
> particular MIME part contained within the message, or both?
Both. I *believe* (but it needs to be checked) that in a correctly
formed multipart MIME object (message or part), any internal structure
is context-free within the MIME boundaries. If that is so, then
individual parts of the object can be stored in raw form and parsed
lazily.
Similarly, for any MIME or RFC 822 object, the object can be parsed
into header section and body section, and each can be stored and
parsed lazily, subject to the condition that the header section must
be sufficiently parsed to identify all headers that might affect
parsing the body part before the body part is parsed. That
"condition" is the context.
Neither of these context conditions apply to correctly formed MIME
trees, but are the only context I'm aware of that can affect parsing of
MIME parts, AFAIK (and I just reread most of the MIME RFCs in the last
few days).
The only context for parsing MIME parts that I'm aware of is that when
determining the end of a nested MIME part, that the search for ending
delimiter must include searching for any higher-level delimiter as
well... to handle the case where the inner delimiter got lost. So one
should search for CR LF --, and then examine the stuff after the -- to
match first the innermost delimiter, and then the next outermost, etc.,
and if finding a match, considering that it is the end of all the parts
nested within the delimiter found, the inner ones being considered
truncated, since their own delimiter was not found.
Unexpected end-of-data should also mark all unterminated nested MIME
parts as incomplete, of course.
The only other cross-part context that I am aware of is Content-ID
references. That doesn't affect parsing, but rather semantic
interpretation, after parsing, validation, and decoding is complete.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options:
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com