On Oct 3, 2009, at 11:41 AM, Stephen J. Turnbull wrote:

Barry Warsaw writes:

So the basic model is: accept strings or bytes at the edges,
process everything internally as bytes, output strings and bytes at
the edges.

In a certain pedantic sense, that can't be right, because bytes alone
can't represent strings.

Practically, you are going need to say how a bytes or bytearray is to
be interpreted as a string, and that is going to be one big mess.
(MIME?)

Going the other way around you have no such problem, or rather the
trivial embedding works fine, except that you have to do a range check
at some point before you convert to bytes.

So, I've taken at least two abortive attempts at updating the email package to Python 3, once using bytes internally and another time using strings internally. Neither one was completely satisfying (to say the least). I've also heard convincing arguments from folks in the Python community in both camps: "using anything other than strings internally is insane; no, using anything other than bytes internally is insane."

As for the internal representational format, I'll amend my previous statement and say that I'll keep an open mind, but one thing that seems very clear is that we have to be able to accept strings and bytes at the incoming edges, and produce strings and bytes at the outgoing edges. In a future message, Stephen outlines some excellent use cases, to which I'll follow up when I get there. But I think he generally hits the nail on the head and proves that we'll have both types at the edges. That makes for very interesting API design!

There's "internal" and then there's the low-level representation that the model exposes. Here I have more confidence that we need make things much more consistent. The trick is to do that while still making things convenient.

For example, we currently represent header values as 8-bit strings or Header instances. The latter can contain triples of the individual chunks, e.g. (content, language, charset). I think we need represent header values as instances in all cases because the type checking is error prone, but even then, it makes for difficult API choices. Still, if the fundamental atom of header values in the model is the Header, and we define both byte and string APIs for headers, then the internal representation matters less since only the email package implementers need to care.

But note that even in this limited case, neither bytes nor strings really works. The internal representation is that triple (and in the current model an implicit triple where charset=us-ascii). So internally the charset is carried along for the ride, as it must be. If the internal representation were just strings or bytes, we wouldn't know how to generate the other format, at least not idempotently (or as close as we can get).

Just to ramble a little longer, it's been argued that we should give up on idempotency, but I'm not convinced. I think people want to see an email message they throw into the system come out the other end as closely as possible (well, /exactly/ for well-formed messages).

-Barry

Attachment: PGP.sig
Description: This is a digitally signed message part

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to