Re: [Python-3000] email libraries: use byte or unicode strings?

Glenn Linderman Thu, 06 Nov 2008 11:48:13 -0800

sorry, this one scrolled off the top, and I didn't read it beforesending my other reply.

On approximately 11/6/2008 9:02 AM, came the following characters fromthe keyboard of Barry Warsaw:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Nov 5, 2008, at 6:39 PM, Glenn Linderman wrote:
This is an interesting perspective... "stuff em" does come to mind :)
But I'm not at all clear on what you mean by a round-trip through theemail module. Let me see... if you are creating an email, you (1)should encode it properly (2) a round-trip is mostly meaningless,unless you send it to yourself. So you probably mean email that isreceived, and that you want to send on. In this case, there isalready a composed/encoded form of the email in hand; it could simplybe sent as is without decoding or re-encoding. That would be quite aclean round-trip!
There are two ways to create an email DOM. One is out of whole cloth(i.e. creating Message objects and their subclasses, then attaching theminto a tree). Note that it is a "generator" whose job it is to take theDOM and produce an RFC-compliant flat textural representation.

I grok this one; but think that for the generator, keeping things inUnicode until the last minute could be useful. Maybe not as useful asconverting immediately to bytes, though, to reduce the amount ofduplicated code.

The other way to get a DOM is to parse some flat textualrepresentation. In this case, it is a core design requirement that theparser never throws an exception, and that there is a way to record andretrieve the defects in a message.

Sure, this makes sense. My other message suggested keeping the messageflat, and using cached pointers and lengths. Of course, editing withsuch a technique could be a problem, because the pointers would have tobe updated. A MIME-mimicking tree of flat subchunks comes to mind...

The core model objects of Message (and their MIME subclasses) and Headershould treat everything internally as bytes. The edges are where youwant to be able to accept varying types, but always convert to bytesinternally. Edges of this system include the parser, the generator, andvarious setter and getter methods of Message and Header.
The current code has a strong desire to be idempotent, so thatparser->DOM->generator output is exactly the same as input. Smallchanges to the DOM or content in between should have minimal effect.For example, if you delete a header and then add it back, the headerwill show up at the end of the RFC 2822 header list, but everything elseabout the message will be unchanged.

Ah, this is your definition of idempotent! Which is what I expected,but wasn't sure.

This is reasonable. One _could_ even convince the header to show up inthe original spot, if you keep a NULL header placeholder around fordeleted headers.... that would vanish only when regenerating.

Currently idempotency is broken for defective messages. The generatoris guaranteed to produce RFC-compliant output, repairing defects likemissing boundaries and such.



So it seems you are happy with this level of "fixing" things?

I guess I'm not terribly concerned about the readability of improperlyencoded email messages, whether they are spam or ham. For thepurposes of SpamBayes (which I assume is similar to spamassassin, onlywritten in Python), it doesn't matter if the data is readable, onlythat it is recognizably similar. So a consistent mis-transliterationis as good a a correct decoding.
The key thing is that parse should never ever raise an exception. We'velearned the hard way that this is the most practical thing because atthe level most parsing happens, you really cannot handle any errors.

So you don't have a goal to make mangled, multi-character encodingssuddenly be readable via the email lib? Only to provide the data in rawform, so that Mr. Turnbull can implement that on top, in emacs?

For ham, the correspondent should be informed that there are problemswith their software, so that they can upgrade or reconfigure it.
That's a practical impossibility in real-world applications, as issimply discarding malformed messages. Email sucks.

I agree it is impossible to do that automatically. But if acorrespondent suddenly gets broken software, I attempt to inform them ofthat... and as long as their email address comes through, I can...

And I don't think I've ever proposed discarding malformed messages; justtransliterating them in some way that (drum roll) doesn't causeexceptions...

Sorry I wrote a bit before looking at the API, which is more robust thanI expected, from Mr. Turnbull's writings. I am curious what the list ofAPI deficiencies that have been determined are... is there a list somewhere?

My summary tried to be a start on that, or an augmentation. Seems Itried to get to bug# last night, but the 'net wasn't responsive. Can'tfind the number now, in a quick look through the messages in this thread.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] email libraries: use byte or unicode strings?

Reply via email to