OK, so we've agreed that we need to handle bytes and text at pretty
much all API levels, and that the "original data" that informs the data
structure can be either bytes or text.  We want to be able to recover
that original data, especially in the bytes case, but arguably also in
the text case.

Then there's also the issue of transforming a message once we have it in
a data structure, and the consequent issue of what it means to serialize
the resulting modified message.  (This last comes up in a very specific
way in issues 968430 and 1670765, which are about preserving the *exact*
byte representation of a multipart/signed message).

We've also agreed that whatever we decide to do with the __str__ and
__bytes__ magic methods, they will be implemented in terms of other
parts of the API.  So I'll ignore those for now.

I think we want to decide on a general API structure that is implemented
at all levels and objects where it makes sense, this being the API
for creating and accessing the following information about any part of
the model:

    * create object from bytes
    * create object from text
    * obtain the defect list resulting from creating the object
    * serialize object to bytes
    * serialize object to text
    * obtain original input data
    * discover the type of the original input data

At the moment I see no reason to change the API for defects (a .defects
attribute on the object holding a list of defects), so I'm going to
ignore that for now as well.

I spent a bunch of time trying to define an API for Headers that provided
methods for all of the above.  As I was writing the descriptions for
the various methods, and especially trying to specify the "correct"
behavior for both the raw-data-is-bytes and raw-data-is-text cases
(especially for the methods that serialize the data), the whole thing
began to give off a bad code smell.

After setting it aside for a bit, I had what I think is a little epiphany:
our need is to deal with messages (and parts of messages) that could be
in either bytes form or text form.  The things we need to do with them
are similar regardless of their form, and so we have been talking about a
"dual API": one method for bytes and a parallel method for text.

What if we recognize that we have two different data types, bytes messages
and text messages?  Then the "dual API" becomes a more uniform, almost
single, API, but with two possible underlying data types.

In the context specifically of the proposed new Header object, I propose
that we have a StringHeader and a BytesHeader, and an API that looks
something like this:

StringHeader

    properties:
        raw_header (None unless from_full_header was used)
        raw_name
        raw_value
        name
        value

    __init__(name, value)
    from_full_header(header)
    serialize(max_line_len=78,
              newline='\n',
              use_raw_data_if_possible=False)
    encode(charset='utf-8')

BytesHeader would be exactly the same, with the exception of the signature
for serialize and the fact that it has a 'decode' method rather than an
'encode' method.  Serialize would be different only in the fact that
it would have an additional keyword parameter, must_be_7bit=True.

The magic of this approach is in those encode/decode methods.

Encoding a StringHeader would yield a BytesHeader containing the same
data, but encoded per RFC2047 using the specified charset.  Decoding a
BytesHeader would yield a StringHeader with the same data, but decoded to
unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
not the RFC2047 sense) using the specified charset (which would default to
ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
with RFC2047 charsets like unknown-8bit is an open question...probably
throw an error).

(Encoding or decoding a Message would cause the Message to recursively
encode or decode its subparts.  This means you are making a complete
new copy of the Message in memory.  If you don't want to do that you
can walk the Message and convert it piece by piece (we could provide a
generator that does this).)

raw_header would be the data passed in to the constructor if
from_full_header is used, and None otherwise.  If encode/decode call
the regular constructor, then this attribute would also act as a flag
as to whether or not the header was constructed from raw input data
or via program.

raw_name and raw_value would be the fieldname and fieldbody, either
what was passed in to the __init__ constructor, or the result of
splitting what was passed to the from_full_header constructor on
the first ':'.  (These are convenience attributes and are not
essential to the proposed API).

name would be the fieldname stripped of trailing whitespace.

value would be the *unfolded* fieldbody stripped of leading and
trailing whitespace (but with internal whitespace intact).

As for serialize, my thought here is that every object in the tree
has a serialize method with the same signature, and serialization
is a matter of recursively passing the specified parameters downward.

max_line_len is obvious, and defaults to the RFC recommended max.  (If you
want the unfolded header, use its .value attribute).  newline resolves
issue 1349106, allowing an email package client to generate completely
wire-format messages if it needs to.  use_raw_data_if_possible would
mean to emit the original raw data if it exists (modulo changing
the flavor of newline if needed, for those object types (such as
headers) where that makes sense).  The serialize method of specific
sub-types can do specialized things (eg: multipart/signed can make
use_raw_data_if_possible default to True).

For Bytes types, the extra 'must_be_7bit' flag would cause any 8bit
data to be transport encoded to be 7bit clean.  (For headers, this would
mean raw 8bit data would get the charset 'unknown-8bit', and we might
want to provide more control over that in some way: an error and way to
provide an error handler, or some other way to specify a charset to use
for such encodings.)  use_raw_data_if_possible would cause this flag to
be ignored when raw data was available for the object.

(If you want the text version of the transport-encoded message for some
reason, you can serialize the Bytes form using must_be_7bit and decode
the result as ASCII.)

Subclasses of these classes for structured headers would have additional
methods that would return either specialized object types (datetimes,
address objects) or bytes/strings, and these may or may not exist in
both Bytes and String forms (that depends on the use cases, I think).

I also think that the Bytes and Strings versions of objects that have
them can share large portions of their implementation through a base
class.  I think that makes this approach both easier to code than a
single-type-dual-API approach, and more robust in the face of changes.

So, those are my thoughts, and I'm sure I haven't thought of all the
corner cases.  The biggest question is, does it seem like this general
scheme is worth pursuing? 

--David
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to