On Feb 17, 2009, at 12:49 PM, Greg Landrum wrote:
I don't know Judy trees. Do you have a reference/pointer?
They aren't by any chance connected to the thing presented in Andrew
Smellie's recent paper (haven't read it yet)?
http://pubs.acs.org/doi/abs/10.1021/ci800325v

Thanks for the link. I've read the paper now. It's very
nicely written, clear, with good background, description,
and results, and it explained something that not only
hadn't I thought of but in retrospect can't understand
why I didn't think of it.

I'll summarize it in my word. Take the fingerprints,
store them in a binary tree. 0 bit means left, 1 bit
means right. The depth in the simple case is the number
of bits in the tree. To find matches for a given query
fp, descend the tree. For each 0 bit in the query,
descend both left and right branches. For each 1 bit,
descent only the right bit. The terminal leaves are
the matching fingerprints.

Andrew Smellie didn't take what I thought would be
the obvious next step and use a trie (probably a
Patricia trie) to store things. He uses a binary
tree, and each node takes 16 bytes (4 bytes for
left child, 4 bytes for right child; where is the
rest?) He even comments they they ran out of memory
for their nearly 1,000,000 compound data set and
outlines reasons why. They switched to using
run-length encoding, but trie structures I've read
about should do better.

For a test set they selected N compounds randomly
(N=10,000; 100,000; 450,477 and 967,749) and chose
the first 2000 compounds as the queries. The times
were about 10x faster than brute force.

(Though I would like to see how they implemented
the brute force code. There's plenty of ways to
make it faster than the first-pass solution.)

The paper also points out how to use the tree for
Tanimoto searches. He uses a "Tanimoto lookahead
condition." I think that could be combined with a
priority queue to to make the search even faster,
but that's just a reflex statement without backing.

Now I want to try it out! :)



                                Andrew
                                da...@dalkescientific.com



Reply via email to