On Thu, 16 Sep 2010 16:51:58 -0400 "R. David Murray" <rdmur...@bitdance.com> wrote: > > What do we store in the model? We could say that the model is always > text. But then we lose information about the original bytes message, > and we can't reproduce it. For various reasons (mailman being a big one), > this is not acceptable. So we could say that the model is always bytes. > But we want access to (for example) the header values as text, so header > lookup should take string keys and return string values[2].
Why can't you have both in a single class? If you create the class using a bytes source (a raw message sent by SMTP, for example), the class automatically parses and decodes it to unicode strings; if you create the class using an unicode source (the text body of the e-mail message and the list of recipients, for example), the class automatically creates the bytes representation. (of course all processing can be done lazily for performance reasons) > What about email files on disk? They could be bytes, or they could be, > effectively, text (for example, utf-8 encoded). Such a file can be two things: - the raw encoding of a whole message (including headers, etc.), then it should be fed as a bytes object - the single text body of a hypothetical message, then it should be fed as a unicode object I don't see any possible middle-ground. > On disk, using utf-8, > one might store the text representation of the message, rather than > the wire-format (ASCII encoded) version. We might want to write such > messages from scratch. But then the user knows the encoding (by "user" I mean what/whoever calls the email API) and mentions it to the email package. What I'm having an issue with is that you are talking about a bytes representation and an unicode representation of a message. But they aren't representations of the same things: - if it's a bytes representation, it will be the whole, raw message including envelope / headers (also, MIME sections etc.) - if it's an unicode representation, it will only be a section of the message decodable as such (a text/plain MIME section, for example; or a decoded header value; or even a single e-mail address part of a decoded header) So, there doesn't seem to be any reason for having both a BytesMessage and an UnicodeMessage at the same abstraction level. They are both representing different things at different abstraction levels. I don't see any potential for confusion: raw assembled e-mail message = bytes; decoded text section of a message = unicode. As for the problem of potential "bogus" raw e-mail data (e.g., undecodable headers), well, I guess the library has to make a choice between purity and practicality, or perhaps let the user choose themselves. For example, through a `strict` flag. If `strict` is true, raise an error as soon as a non-decodable byte appears in a header, if `strict` is false, decode it through a default (encoding, errors) convention which can be overriden by the user (a sensible possibility being "utf-8, surrogateescape" to allow for lossless round-tripping). > As I said above, we could insist that files on > disk be in wire-format, and for many applications that would work fine, > but I think people would get mad at us if didn't support text files[3]. Again, this simply seems to be two different abstraction levels: pre-generated raw email messages including headers, or a single text waiting to be embedded in an actual e-mail. > Anyway, what polymorphism means in email is that if you put in bytes, > you get a BytesMessage, if you put in strings you get a StringMessage, > and if you want the other one you convert. And then you have two separate worlds while ultimately the same concepts are underlying. A library accepting BytesMessage will crash when a program wants to give a StringMessage and vice-versa. That doesn't sound very practical. > [1] Now that surrogateesscape exists, one might suppose that strings > could be used as an 8bit channel, but that only works if you don't need > to *parse* the non-ASCII data, just transmit it. Well, you can parse it, precisely. Not only, but it round-trips if you unparse it again: >>> header_bytes = b"From: bogus\xFFname <some...@python.com>" >>> name, value = header_bytes.decode("utf-8", "surrogateescape").split(":") >>> name 'From' >>> value ' bogus\udcffname <some...@python.com>' >>> "{0}:{1}".format(name, value).encode("utf-8", "surrogateescape") b'From: bogus\xffname <some...@python.com>' In the end, what I would call a polymorphic best practice is "try to avoid bytes/str polymorphism if your domain is well-defined enough" (which I admit URLs aren't necessarily; but there's no question a single text/XXX e-mail section is text, and a whole assembled e-mail message is bytes). Regards Antoine. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com