Re: [Email-SIG] Thoughts on the general API, and the Header API.

Glenn Linderman Mon, 25 Jan 2010 16:55:34 -0800

On approximately 1/25/2010 12:10 PM, came the following characters fromthe keyboard of R. David Murray:

So, those are my thoughts, and I'm sure I haven't thought of all the
corner cases.  The biggest question is, does it seem like this general
scheme is worth pursuing?

Moving your last question to the front, yes. And of course, we do needto think through most of the corner cases before absolutely committingto this approach. But it sounds viable, and avoids an awful lot ofduplicate APIs, and would allow simple email clients to be writtenprimarily or even fully in bytes or primarily or even fully in strings.

A simple email client that is written fully in strings would "simply"reject/bounce messages that cannot be decoded to strings. This issimple; it works for 100% properly encoded messages; in an environmentwhere a client is coded to process messages from some generator, oncethey are both debugged to the extent of generating messages that can beconsumed, then all is well, and no messages would be rejected. Thiswould not be an appropriate model for a general email server; while I'dlike to see a popular mailing list submission client that would bouncemessages that are improperly formed -- forcing contributors to use RFCconformant clients, and thus encouraging the of those clients that arenot RFC conformant, but I'm not going to hold my breath.

I think there can be enough power in an API designed in this manner toallow the full nitty-gritty access as required.

I have some questions and concerns; I haven't thought through all ofthem; perhaps some of them are corner cases, if so, they are cornercases that are particularly interesting to me.

OK, so we've agreed that we need to handle bytes and text at pretty
much all API levels, and that the "original data" that informs the data
structure can be either bytes or text.  We want to be able to recover
that original data, especially in the bytes case, but arguably also in
the text case.

Then there's also the issue of transforming a message once we have it in
a data structure, and the consequent issue of what it means to serialize
the resulting modified message.  (This last comes up in a very specific
way in issues 968430 and 1670765, which are about preserving the *exact*
byte representation of a multipart/signed message).

We've also agreed that whatever we decide to do with the __str__ and
__bytes__ magic methods, they will be implemented in terms of other
parts of the API.  So I'll ignore those for now.

I think we want to decide on a general API structure that is implemented
at all levels and objects where it makes sense, this being the API
for creating and accessing the following information about any part of
the model:

     * create object from bytes
     * create object from text
     * obtain the defect list resulting from creating the object
     * serialize object to bytes
     * serialize object to text
     * obtain original input data
     * discover the type of the original input data

At the moment I see no reason to change the API for defects (a .defects
attribute on the object holding a list of defects), so I'm going to
ignore that for now as well.

I spent a bunch of time trying to define an API for Headers that provided
methods for all of the above.  As I was writing the descriptions for
the various methods, and especially trying to specify the "correct"
behavior for both the raw-data-is-bytes and raw-data-is-text cases
(especially for the methods that serialize the data), the whole thing
began to give off a bad code smell.

After setting it aside for a bit, I had what I think is a little epiphany:
our need is to deal with messages (and parts of messages) that could be
in either bytes form or text form.  The things we need to do with them
are similar regardless of their form, and so we have been talking about a
"dual API": one method for bytes and a parallel method for text.

What if we recognize that we have two different data types, bytes messages
and text messages?  Then the "dual API" becomes a more uniform, almost
single, API, but with two possible underlying data types.

In the context specifically of the proposed new Header object, I propose
that we have a StringHeader and a BytesHeader, and an API that looks
something like this:

StringHeader

     properties:
         raw_header (None unless from_full_header was used)
         raw_name
         raw_value
         name
         value

     __init__(name, value)
     from_full_header(header)
     serialize(max_line_len=78,
               newline='\n',
               use_raw_data_if_possible=False)
     encode(charset='utf-8')

If it was stated, I missed it: is from_full_header a way of producingan object from a raw data value? Whereas __init__ would obviously beused to produce one from string or bytes values. If so, then it wouldbe a requirement that this from_full_header API would never produce anexception? Rather it would produce an object with or without defects?

Are there any other *Header APIs that would be required not to produceexceptions? I don't yet perceive any.


The "charset" parameter... is that not mostly needed for data parts?
Headers are either ASCII, or contain self-describing charset info.

I guess I could see an intermediate decode from string to some charset,before serialization, as a hint that when generating headers, that allthe characters in the header that are not ASCII are in the specifiedcharset... and that that charset is the one to be used in theself-describing serialized ASCII stream? The full generality of theRFCs, however,allows pieces of headers to be encoded using different charsets... withthis API, it would seem that that could only be created containing onecharset... the serialization primitives were made available, so thatpiecewise construction of a header value could be done with differentcharsets, and then the from_full_header API used to create the complexvalue. I don't see this as a severe limitation, I just want tounderstand your intention, and document the limitation, or mymisunderstanding.

BytesHeader would be exactly the same, with the exception of the signature
for serialize and the fact that it has a 'decode' method rather than an
'encode' method.  Serialize would be different only in the fact that
it would have an additional keyword parameter, must_be_7bit=True.

I am not clear on why StringHeader's serialize would not need themust_be_7bit parameter... or do I misunderstand thatStringHeader.serialize produces wire-format data?

The magic of this approach is in those encode/decode methods.

Encoding a StringHeader would yield a BytesHeader containing the same
data, but encoded per RFC2047 using the specified charset.  Decoding a
BytesHeader would yield a StringHeader with the same data, but decoded to
unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
not the RFC2047 sense) using the specified charset (which would default to
ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
with RFC2047 charsets like unknown-8bit is an open question...probably
throw an error).

Would the encoding to/from StringHeader/BytesHeader preserve thefrom_full_header state and value?

(Encoding or decoding a Message would cause the Message to recursively
encode or decode its subparts.  This means you are making a complete
new copy of the Message in memory.  If you don't want to do that you
can walk the Message and convert it piece by piece (we could provide a
generator that does this).)

Walking it piece by piece would allow the old pieces to be discarded, tosave total memory consumption, where that is appropriate.

Perhaps one generator that would be commonly used, would be to convertheaders only, and leave MIME data parts alone, accessing and convertingthem only with the registered methods? This would mean that a "completecopy" wouldn't generally be very big, if the data parts were excludedfrom implicit conversion. Perhaps the "external storage protocol" mightalso only be defined for MIME data parts, and walking the tree with thisgenerator would not need to reference the MIME data parts, nor bringthem in from "external storage".

raw_header would be the data passed in to the constructor if
from_full_header is used, and None otherwise.  If encode/decode call
the regular constructor, then this attribute would also act as a flag
as to whether or not the header was constructed from raw input data
or via program.

This _implies_ that from_full_header always accepts raw data bytes...even for the StringHeader. And that implies the need for an implicitdecode, and therefore, perhaps a charset parameter? No, not a charsetparameter, since they are explicitly contained in the header values.


Decode for header values may not need a charset value at all!


No comments for the rest.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] Thoughts on the general API, and the Header API.

Reply via email to