On Feb 17, 2009, at 12:40 PM, Greg Landrum wrote:
Well, now I'm incredibly behind in all this. I will try to slowly catch up.
That'll teach you not to take a vacation. ;) Seriously though, I was writing as I worked, which means there's a lot of verbiage and places where I wasn't clear on things. The last email puts everything together.
I've generated a new, larger, testing dataset using the pubchem HTS compounds. I will also post the details on those (hopefully this morning).
Cool. I've asked a few people/lists for data sets but no response yet. There's a few I'll try.
I don't know Judy trees. Do you have a reference/pointer?
Oops, "judy array" http://judy.sourceforge.net/ http://en.wikipedia.org/wiki/Judy_array and I did a (buggy as it turns out) wrapper at http://www.dalkescientific.com/Python/PyJudy.html when I last looked into substructure fp filters. My idea then and now was to store a mapping from: unique path identifier -> sorted list of matching compounds Substructure filtering is the same as generating all paths and finding the intersection of the sorted lists. I think this is called an inverted index. Most paths are rare, so storing all those paths doesn't take much space. I was thinking that a sorted list works better than a hash or normal trie because I could do an N-way merge to find the intersection, rather than a lot of membership tests. But in reflection, the latter may be faster. Looks like more testing will occur.
They aren't by any chance connected to the thing presented in Andrew Smellie's recent paper (haven't read it yet)? http://pubs.acs.org/doi/abs/10.1021/ci800325v
Not at all. I really need to visit the library soon. Or pay $30 for 24 hour access to ACS, plus unknown price for access to Ullmann's paper.
I think it's worth looking into branched paths as well for real substructure searches. People don't query with linear fragments all that often, so it seems like it would be a win.
While people don't query with liner fragments, more complex structures contain linear subparts, including crossing paths. My thought was that linear paths are easy to generate and canonicalize, and would give a baseline limit to more sophisticated schemes. Andrew da...@dalkescientific.com