On Thu, 03 Mar 2011 16:28:32 +0100, Steffen Daode Nurpmeso <sdao...@googlemail.com> wrote: > On Wed, Mar 02, 2011 at 07:50:20PM -0500, R. David Murray wrote: > > That is, if the defects list is non-empty, > > the message is technically malformed. Of course, that information by > > itself isn't necessarily useful, which is why defects is a list > > of defects. > > "is_processable" lies in the eyes of the application. > > What defects is it capable of dealing with? The email package > > can't know that. So, again, that's why defects is a list. > > > > Let me clarify what I mean by the policy controlling "what, exactly, is > > a defect". The idea here is that when parsing an email, each deviance > > from the RFCs counts as a defect (the current email package, by the way, > > only detects a small number of such defects!). But when parsing, say, > > an http stream, non-ascii characters in headers are perfectly legal. > > So it seems to make sense that the HTTP policy would change what counts > > as a defect during the operation of the parser. > > So i would hope for '.all_defects[]' and (policy-adjusted) > '.defects[]'. I would hope for > '.had_header_defects(policy_only=True)', > '.had_payload_defects(policy_only=True)'.
Well, what is a defect for an HTTP parse is not the same as what is a defect for an email parse, so I don't know what "all defects" would consist of. The recovery decisions the parser makes can also be affected by the policy, so there can't, as far as I can see, be a single list of "all defects" that applies to all parses. Currently the email package does not report header defects. When it does, my plan is that each Header will have its own defect list, and likewise each message body (using a recursive definition). How the defects list on the Message object interacts with this is an interesting API question worthy of discussion. Perhaps we do, after all, have some sort of "has_defects" method that queries the constituent parts, and perhaps a function that returns a list of parts with defects, possibly divided between headers and body as you suggest. > Doing so would fill the huge hole in between 'not len(defects)' > and the detailed inspection of a defects list which consists of > a highly differentiated tree of classes. Yeah, the number of different defect classes involved in this scheme worries me a little bit. > The parser has to parse- and does encounter all of these anyway, > and an application cannot re-collect this (dropped) information > except with expensive effort, i.e. at least choosing a different, > stricter policy followed by another parse of the bogus mail. Why recollect? The list is there (and, as I indicated above, will be associated with the part that contains the error). The list of defects will be *all* the defects detected by that policy: all RFC deviance (well, perhaps not quite all...see below). Defects don't normally raise errors, so there's no reason not lot look for all of the relevant ones (and indeed, we are probably only detecting the ones that actually affect the parsing). That is, if you parse an HTTP stream, encountering a non-ASCII character is *not* a defect. It doesn't make any sense to me to report an "if this were an email this would be a defect" defect. And if the header for some strange reason included an RFC2047 encoded word that was invalidly formed...well, in an HTTP parse that would *technically* violate the RFC, but in practice it really means that the data should just be passed through as is. That is, it's not a defect, and we would be be wasting time even *looking* for RFC2047 encoded words. (Unless someone finds a browser or server that generates them!) In other words, in the base package I don't think there are "strict" and "less strict" parsing policies; rather there are *different* parsing policies depending on the context. As far as I can see, it makes no sense to parse an HTTP stream, and the reparse it as if it were an email stream. Now, it might be useful to design a "very_strict" policy that did extra work looking for RFC defects that a normal parse wouldn't detect (I can't think of any off the top of my head, but the email RFCs are so complex that I'm sure there are some), but in that case if you parsed it with the less-strict (normal) policy those defects would *not* be noticed by the parser. In any case, I think such a validating parser/policy is out of scope for the current package. --David _______________________________________________ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com