Barry Warsaw writes: > On Oct 07, 2010, at 04:40 AM, Stephen J. Turnbull wrote:
> I'm fairly certain that most of the modern causes of [Unicode > errors in Mailman] are post-parse modifications of the message. > IOW, in Mailman's architecture, we try to parse the raw data into a > Message object tree very early in the pipeline, and then a pickled > version of that gets passed between the queue runners. > > Where we've gotten into trouble before has been things like adding > the Subject prefixes and such. Not to mention those wonderful unremovable addresses containing TAB etc. But I'm pretty sure I've seen reports at least in 2.1.9, and probably more recently than that, where there was 8-bit content in a header of the incoming message and Mailman blew up on that. This is stuff that should have been shunted explicitly, but instead managed to get out of the parser and then blow up. I don't think the errors I'm thinking about were due to Mailman manipulations, but rather insufficient paranoia in handling incoming hazmat. > That seems like application logic that the email package can't > really get involved with, and indeed Mailman has built up a raft of > defense for failures of this kind. But adding Subject prefixes and the like shouldn't be a problem as long is the internal representation of each message object (bytes vs str) is fixed and the representation is opaque, so that the module can do appropriate conversions when necessary. The problem that you face in Python 2 is that that separation is not properly made, and the same values in the message object can often serve as text and as wire format, and it's hard to tell which is which. The Unicode handling is tacked on as an afterthought. That mess is entirely unnecessary in Python 3. Text and wire format can be easily distinguished with three different representations of email: Unicode for the conceptual RFC 822 layer (of course this is an extension, because RFC 822 itself is strictly limited to the ASCII subset), bytes for wire format, and Message objects for modern structured mail (including MIME, etc). *If* email6 is reengineered with that kind of structure, then you should be able to dispense with almost all of the raft of defense, because the email module will give you well-behaved Message objects, whose text components (including the header) are well-behaved character strings that mix seamlessly with other character strings. Maybe even in email5 .... _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com