Re: [Email-SIG] rfc822 parser (the elephant has landed)

R. David Murray Wed, 08 Jun 2011 15:47:27 -0700

On Wed, 08 Jun 2011 16:48:50 -0400, Barry Warsaw <ba...@python.org> wrote:
> * Changing the __setitem__ API.  I've always thought about this as a pure
>   convenience, and that appending was the most convenient semantics.  Other
>   methods, e.g. replace_header() should be included to provide the range of
>   semantics that people want.  Then we'd just pick one and alias it to
>   __setitem__.  I'm mixed as to whether appending still is the most convenient
>   alias, since in my own code I often `del msg[header]; msg[header] = foo`.
>   But that also changes the header order so it's not a perfect replacement.


Yeah, it would be really nice if setting (say) 'To' replaced it, but
setting (say) 'Resent-To' appended.  But that way lies chaos :)

One of my ideas is to eventually decouple the header dictionary from the
Message.  That is, you access the headers through msg.headers instead
of directly on msg.  At that point we could get away with changing
the semantics of __setitem__, and have msg.headers[X] be 'replace'.
Having append be spelled 'msg.headers.append(X)' seems slightly more
natural than having replace spelled msg.headers.replace(X), so that's
what I'd be in favor of.

> * Unique headers: is this controlled or influenced by a policy?  For example,
>   duplicate Subjects might be disallowed by RFC 5322, but could conceivably be
>   allowed (or at least not prohibited) by other email-like protocols.

Right now it is always applied, but IMO it needs to be a policy setting.
So despite my thought that Messages don't have a policy, it turns out
that they do :(.  I haven't thought through how to handle that yet, though
the obvious way is to set attributes on the Message when it is created.
Perhaps what needs to be controlled on a Message is what Defects are
considered to be errors that should be raised.

An alternative would be to take the uniqueness check out of __setitem__
and do that check only at message generation time, if the policy says to
do so.  I'd prefer that the immediate raise be available as an option,
myself, since it seems like it would catch programming errors sooner
and thus make for a better user experience.

>   Also, while some fields like CC allow only occurrence, it can contain
>   multiple values in that single field.  Is it totally insane to say that
>   `msg['cc'] = 'address'` would append `address` to the existing value?  It
>   probably is, but having to do that manually also kind of sucks.

Yeah I think that would be insane :).  But += isn't and I want to support
that, as you note later.

>   Some headers have other constraints (RFC 5322, $3.6).  For example
>   Message-ID can technically appear zero times, but "SHOULD be present".  Part
>   of me thinks it should be out of scope for email6 to enforce this, and I'm
>   not sure where that would get enforced anyway, but I'm just wondering if
>   you've thought about that.

That one I think can only be enforced when the message is known to be
"complete", which would be when it is transmitted.  So the generator
could have a policy setting that controls whether or not a lack of 
a Message-ID is a raisable error.

> * Datetimes: \o/.  It will be awesome when I can `msg['date'] = a_datetime`.
>   While it does seem reasonable that a naive datetime uses -0000, it should
>   also be very easy for folks to add a Date header that references the local
>   timezone, since I suspect that will be a more common use case than UTC.  I
>   don't know what the answer for that is though.

Well, Alexander has an answer (a function that returns an aware localtime
in the datetime module) but hasn't gotten consensus on adding it.
Perhaps I'll add such a function to email6, at least for the field trials.

> * As for header parsing, have you looked at the pyparsing module?  I don't
>   write many parsers, and have no direct experience with pyparsing, but I keep
>   hearing really good things about it.  OTOH, it's not in the stdlib, so it
>   would present problems if email6 were to adopt it.  Still, I don't envy this
>   part of the job, and I sympathize with the rabbit-hole effect of "just one
>   more little thing..." ;)  Oh, and I'm just blown away impressed by the work
>   you've done on the parser.

I thought about pyparsing (though I haven't tried it out myself), but
I think its scope is much wider than email6 needs, and getting it in to
the stdlib should be an independent project if doing so seems worthwhile.
I don't think email6 should depend on anything not already in the stdlib.
In any case, at this point I think the hard part of the parser is done,
and everything else is incremental additions and tweaks.

Something I didn't say in my blog post is that I'm thinking of marking
rfc822_parser as a private module for the 3.3 release, but that a long
term goal would be to expose it, if it proves to be worthwhile and useful
apart from its internal use in email6.  I think there are occasions when
programs need to do non-email rfc822 parsing, where it could come in handy
(perhaps with a few API tweaks to optionally suppress  email-specific hacks).

Alternatively, the parser might get replaced by something else that does
the same job, especially if it proves to be a performance bottleneck.

> * Are there operations on Groups and Mailboxes?  E.g. in your example, I see
>   that you added `dinsd...@python.org` to the To header by string
>   concatenation.  What if for example, I had a number of addresses that I
>   wanted to combine into a Reply-To header (which RFC 5322 says I can only
>   have one of).  Would I be able to do something like the following:
> 
>   >>> msg['reply_to'].mailboxes.append('anot...@example.com')
> 
>   and have the printed representation of the message look correct?  Ah, maybe
>   something like your last example in the What's Missing section covers this.

Yes.  Headers are immutable, so 'append' is not the appropriate operation
for this.  + or += is.  What I'm thinking is that the current Mailbox
and Group objects should be enhanced so that there is a nice API for
creating them from various kinds of input data, and an explicit AddresList
object added, and then they can be passed around, summed, and maybe even
subtracted with each other and with AddressList valued header fields.

> * Oooh!  Your example has an `== None` which should probably be `is None` :)

Heh.  Oops :)  At least I ran the doc tests this time before posting.

> Really, *really* fantastic stuff.

Thanks.

--
R. David Murray           http://www.bitdance.com
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] rfc822 parser (the elephant has landed)

Reply via email to