On Fri, 11 Jul 2003, Cyrus Daboo wrote: > Unfortunately I have of late been running more and more into the problem of > users wanting to THREAD very large mailboxes and complaining when it takes > a long time. They are not keen on restricting the input message set to e.g. > just unseen or messages sent within a recent period of time.
Restricting the input message set isn't very useful with threading, since it changes the thread output even if you end up filtering the output. You need to consider all the messages in order to get the relationships right. > I would > strongly urge server implementers to look beyond merely caching references > headers, and to come up with smart caching of the actual thread tree > structure itself. I'm surprised that you're having a performance problem if you cache the references. At least in my server, the cost is in loading the sortcache (which is a once-only hit) rather than in actually doing the threading which seems to be more or less instantaneous. Do you cache the references as a single string, or as a linked list of message-ids? I do the latter. I also cache the extracted subject, date, and message-id. You may also want to check into your hashing algorithm, and make sure that your hash table is large enough. If you use a general hashing algorithm, be sure that it allows you to make the hash buckets any size that you want. Each bucket needs the sortcache pointer, a parent pointer, a next-sibling pointer, and a child pointer, for a total of four data words per bucket. Since none of the hash table packages worked the way I wanted, I wrote my own. Pretty basic, doing a polynomial of each byte of the key, with a prime-sized hash table. But it seems fast enough, and since it was my own algorithm I was able to make sure that it had all the data I need. Remember: "any algorithm is trivial if you have the right data structures"! -- Mark -- http://staff.washington.edu/mrc Science does not emerge from voting, party politics, or public debate. Si vis pacem, para bellum.
