On Feb 17, 2009, at 8:30 PM, Andrew Dalke wrote:
Thanks for the link. I've read [Andrew Smellie's] paper now. It's very
nicely written, clear, with good background, description,
and results,
...
Andrew Smellie didn't take what I thought would be
the obvious next step and use a trie (probably a
Patricia trie) to store things.
...
Now I want to try it out! :)
I made a first pass implementation of my trie idea using
Python. It's attached. My quick summary is:
fingerprint size of time for size of time for time for
size bits bits trie trie trie (no hits)
=========== ======= ======== ======= ======== ========
98837 14 MB 0.05s 59 MB 0.6s 0.04s
198819 28 MB 0.09s 105 MB 0.7s 0.07s
Based on the paper, it seems his C code requires about
160 MB for 200,000 compounds so there's a clear savings
due to at least one of the two memory savings techniques
I tried.
I expect faster performance for the regular "&" based
screens if I were to do the test directly in C. I'm
guessing a factor of 3.
I expect an even better performance increase and reduced
memory use were I to rewrite the trie in C. There I
expect about 10x-30x performance, which brings them
to the same times. But that's a scientific wild-ass guess.
There's no real conclusion you can draw from these numbers
other than that I've managed to save a lot of memory and
that this is an avenue worth considering.
Andrew
[email protected]