On Feb 17, 2009, at 12:49 PM, Greg Landrum wrote:
I don't know Judy trees. Do you have a reference/pointer? They aren't by any chance connected to the thing presented in Andrew Smellie's recent paper (haven't read it yet)? http://pubs.acs.org/doi/abs/10.1021/ci800325v
Thanks for the link. I've read the paper now. It's very nicely written, clear, with good background, description, and results, and it explained something that not only hadn't I thought of but in retrospect can't understand why I didn't think of it. I'll summarize it in my word. Take the fingerprints, store them in a binary tree. 0 bit means left, 1 bit means right. The depth in the simple case is the number of bits in the tree. To find matches for a given query fp, descend the tree. For each 0 bit in the query, descend both left and right branches. For each 1 bit, descent only the right bit. The terminal leaves are the matching fingerprints. Andrew Smellie didn't take what I thought would be the obvious next step and use a trie (probably a Patricia trie) to store things. He uses a binary tree, and each node takes 16 bytes (4 bytes for left child, 4 bytes for right child; where is the rest?) He even comments they they ran out of memory for their nearly 1,000,000 compound data set and outlines reasons why. They switched to using run-length encoding, but trie structures I've read about should do better. For a test set they selected N compounds randomly (N=10,000; 100,000; 450,477 and 967,749) and chose the first 2000 compounds as the queries. The times were about 10x faster than brute force. (Though I would like to see how they implemented the brute force code. There's plenty of ways to make it faster than the first-pass solution.) The paper also points out how to use the tree for Tanimoto searches. He uses a "Tanimoto lookahead condition." I think that could be combined with a priority queue to to make the search even faster, but that's just a reflex statement without backing. Now I want to try it out! :)
Andrew da...@dalkescientific.com