On approximately 1/25/2010 12:10 PM, came the following characters from the keyboard of R. David Murray:
So, those are my thoughts, and I'm sure I haven't thought of all the
corner cases.  The biggest question is, does it seem like this general
scheme is worth pursuing?

Moving your last question to the front, yes. And of course, we do need to think through most of the corner cases before absolutely committing to this approach. But it sounds viable, and avoids an awful lot of duplicate APIs, and would allow simple email clients to be written primarily or even fully in bytes or primarily or even fully in strings.

A simple email client that is written fully in strings would "simply" reject/bounce messages that cannot be decoded to strings. This is simple; it works for 100% properly encoded messages; in an environment where a client is coded to process messages from some generator, once they are both debugged to the extent of generating messages that can be consumed, then all is well, and no messages would be rejected. This would not be an appropriate model for a general email server; while I'd like to see a popular mailing list submission client that would bounce messages that are improperly formed -- forcing contributors to use RFC conformant clients, and thus encouraging the of those clients that are not RFC conformant, but I'm not going to hold my breath.

I think there can be enough power in an API designed in this manner to allow the full nitty-gritty access as required.

I have some questions and concerns; I haven't thought through all of them; perhaps some of them are corner cases, if so, they are corner cases that are particularly interesting to me.

OK, so we've agreed that we need to handle bytes and text at pretty
much all API levels, and that the "original data" that informs the data
structure can be either bytes or text.  We want to be able to recover
that original data, especially in the bytes case, but arguably also in
the text case.

Then there's also the issue of transforming a message once we have it in
a data structure, and the consequent issue of what it means to serialize
the resulting modified message.  (This last comes up in a very specific
way in issues 968430 and 1670765, which are about preserving the *exact*
byte representation of a multipart/signed message).

We've also agreed that whatever we decide to do with the __str__ and
__bytes__ magic methods, they will be implemented in terms of other
parts of the API.  So I'll ignore those for now.

I think we want to decide on a general API structure that is implemented
at all levels and objects where it makes sense, this being the API
for creating and accessing the following information about any part of
the model:

     * create object from bytes
     * create object from text
     * obtain the defect list resulting from creating the object
     * serialize object to bytes
     * serialize object to text
     * obtain original input data
     * discover the type of the original input data

At the moment I see no reason to change the API for defects (a .defects
attribute on the object holding a list of defects), so I'm going to
ignore that for now as well.

I spent a bunch of time trying to define an API for Headers that provided
methods for all of the above.  As I was writing the descriptions for
the various methods, and especially trying to specify the "correct"
behavior for both the raw-data-is-bytes and raw-data-is-text cases
(especially for the methods that serialize the data), the whole thing
began to give off a bad code smell.

After setting it aside for a bit, I had what I think is a little epiphany:
our need is to deal with messages (and parts of messages) that could be
in either bytes form or text form.  The things we need to do with them
are similar regardless of their form, and so we have been talking about a
"dual API": one method for bytes and a parallel method for text.

What if we recognize that we have two different data types, bytes messages
and text messages?  Then the "dual API" becomes a more uniform, almost
single, API, but with two possible underlying data types.

In the context specifically of the proposed new Header object, I propose
that we have a StringHeader and a BytesHeader, and an API that looks
something like this:

StringHeader

     properties:
         raw_header (None unless from_full_header was used)
         raw_name
         raw_value
         name
         value

     __init__(name, value)
     from_full_header(header)
     serialize(max_line_len=78,
               newline='\n',
               use_raw_data_if_possible=False)
     encode(charset='utf-8')

If it was stated, I missed it: is from_full_header a way of producing an object from a raw data value? Whereas __init__ would obviously be used to produce one from string or bytes values. If so, then it would be a requirement that this from_full_header API would never produce an exception? Rather it would produce an object with or without defects?

Are there any other *Header APIs that would be required not to produce exceptions? I don't yet perceive any.

The "charset" parameter... is that not mostly needed for data parts?
Headers are either ASCII, or contain self-describing charset info.
I guess I could see an intermediate decode from string to some charset, before serialization, as a hint that when generating headers, that all the characters in the header that are not ASCII are in the specified charset... and that that charset is the one to be used in the self-describing serialized ASCII stream? The full generality of the RFCs, however, allows pieces of headers to be encoded using different charsets... with this API, it would seem that that could only be created containing one charset... the serialization primitives were made available, so that piecewise construction of a header value could be done with different charsets, and then the from_full_header API used to create the complex value. I don't see this as a severe limitation, I just want to understand your intention, and document the limitation, or my misunderstanding.


BytesHeader would be exactly the same, with the exception of the signature
for serialize and the fact that it has a 'decode' method rather than an
'encode' method.  Serialize would be different only in the fact that
it would have an additional keyword parameter, must_be_7bit=True.

I am not clear on why StringHeader's serialize would not need the must_be_7bit parameter... or do I misunderstand that StringHeader.serialize produces wire-format data?

The magic of this approach is in those encode/decode methods.

Encoding a StringHeader would yield a BytesHeader containing the same
data, but encoded per RFC2047 using the specified charset.  Decoding a
BytesHeader would yield a StringHeader with the same data, but decoded to
unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
not the RFC2047 sense) using the specified charset (which would default to
ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
with RFC2047 charsets like unknown-8bit is an open question...probably
throw an error).

Would the encoding to/from StringHeader/BytesHeader preserve the from_full_header state and value?

(Encoding or decoding a Message would cause the Message to recursively
encode or decode its subparts.  This means you are making a complete
new copy of the Message in memory.  If you don't want to do that you
can walk the Message and convert it piece by piece (we could provide a
generator that does this).)

Walking it piece by piece would allow the old pieces to be discarded, to save total memory consumption, where that is appropriate.

Perhaps one generator that would be commonly used, would be to convert headers only, and leave MIME data parts alone, accessing and converting them only with the registered methods? This would mean that a "complete copy" wouldn't generally be very big, if the data parts were excluded from implicit conversion. Perhaps the "external storage protocol" might also only be defined for MIME data parts, and walking the tree with this generator would not need to reference the MIME data parts, nor bring them in from "external storage".

raw_header would be the data passed in to the constructor if
from_full_header is used, and None otherwise.  If encode/decode call
the regular constructor, then this attribute would also act as a flag
as to whether or not the header was constructed from raw input data
or via program.

This _implies_ that from_full_header always accepts raw data bytes... even for the StringHeader. And that implies the need for an implicit decode, and therefore, perhaps a charset parameter? No, not a charset parameter, since they are explicitly contained in the header values.

Decode for header values may not need a charset value at all!


No comments for the rest.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to