On Feb 17, 2009, at 8:30 PM, Andrew Dalke wrote:
Thanks for the link. I've read [Andrew Smellie's] paper now. It's very nicely written, clear, with good background, description, and results,
...
Andrew Smellie didn't take what I thought would be the obvious next step and use a trie (probably a Patricia trie) to store things.
...
Now I want to try it out! :)
I made a first pass implementation of my trie idea using Python. It's attached. My quick summary is: fingerprint size of time for size of time for time for size bits bits trie trie trie (no hits) =========== ======= ======== ======= ======== ======== 98837 14 MB 0.05s 59 MB 0.6s 0.04s 198819 28 MB 0.09s 105 MB 0.7s 0.07s Based on the paper, it seems his C code requires about 160 MB for 200,000 compounds so there's a clear savings due to at least one of the two memory savings techniques I tried. I expect faster performance for the regular "&" based screens if I were to do the test directly in C. I'm guessing a factor of 3. I expect an even better performance increase and reduced memory use were I to rewrite the trie in C. There I expect about 10x-30x performance, which brings them to the same times. But that's a scientific wild-ass guess. There's no real conclusion you can draw from these numbers other than that I've managed to save a lot of memory and that this is an avenue worth considering. Andrew da...@dalkescientific.com