On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote: >Folks, I'm working on an implementation of RFC 5256 email threading, >designed so that it could fit as a submodule in the "email" package, if >such a think was ever seen to be useful.
I really like the idea of threading support being included in the email package. (I admit that I don't have time right now to read the RFC.) My general thoughts are that the actual messages needn't be included in the thread collection, but perhaps just Message-IDs. That would allow an application to store the actual message objects anywhere they want, and would reduce space requirements of the thread collection. >I'd like to ask "the wisdom of the crowd" what they think an appropriate >interface to such a thing would be? The basic operation is that you >create a collection (type C) of email threads (type T) by passing a set >of messages (type M) to the constructor. > >* Should M be required to be "email.message.Message", or perhaps some > less restrictive type, say "ThreadableMessageAPI"? All that's > strictly required is the ability to retrieve the Message-ID, Subject, > Date, References, and In-Reply-To fields. I think it would be fine then to allow duck-typing of the input objects. I don't have a sense of whether it needs a formal (as in Python's ABCs) interface type. >* What operations should be possible on C? Some that come to mind: > > * retrieve_thread (M or message-id) => T Message-ID as input. > * add_message (M) => T Duck-typed message. > * add_messages (set of M) => None > * remove_message (M or message-id) => T (or None) ? Probably Message-ID as the input. I guess the rule would be that if you need all the headers you mention above, a duck-typed message would be required. For operations that only need the Message-ID, just accept that. And you probably want the full Message-ID header value, e.g. it would include the angle brackets. >* What's the interface for T? It's a tree with possible dummy nodes, so > a tuple of messages plus nested tuples would do it. What should the > nodes in the tree be? Normalized (see RFC 5256) Message-IDs? > email.message.Message instances? Will the tree get mutated when a message is added in the middle of a thread, or will you generate a new tree? That would make a difference for tuple-of-tuples or list-of-lists. I think the nodes would be Message-IDs, but you'd need a public API for normalizing them, and my application would have to make sure that my messages are normalized (or at least the lookup keys are) or I might not be able to find a message given its normalized id. OTOH, maybe the message parser or message object itself should provide an API for normalizing ids? Let's think about some use cases. - given any message, find the entire thread it's a part of - given a message, find all children - given a message, find a path to the root of the thread - find the parts of the thread that fall within a date range - find the parts of a thread with a matching subject >* For large sets of threads (millions of messages) a persistence > mechanism would be useful. Should there be a standard interface to > such a mechanism, perhaps as class methods on C? If so, what should > it look like? Should the implementation contain a default persistent > subclass of C, based on sqlite3? What side-effects would persistence > requirements have on the other design considerations? For instance, > would you have to save the entire text of a message for each node? > Just the headers? Just some of the headers? Just the Message-ID? Great questions. We've long talked about a persistence mechanism for message parts (e.g. store the big binary parts on disk instead of in memory). Some consistency of design would be good here. But I agree that persistence should definitely be part of the story, and it needs to be plugable. Have to think more about this, but a big +1 for the idea. It would serve as a very good component for the ideas I have about a next generation email archiver. -Barry _______________________________________________ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com