On Feb 17, 2009, at 12:40 PM, Greg Landrum wrote:
Well, now I'm incredibly behind in all this. I will try to slowly catch up.

That'll teach you not to take a vacation.  ;)

Seriously though, I was writing as I worked, which means there's
a lot of verbiage and places where I wasn't clear on things.  The
last email puts everything together.

I've generated a new, larger, testing dataset using the pubchem HTS
compounds. I will also post the details on those (hopefully this
morning).

Cool. I've asked a few people/lists for data sets but no response
yet. There's a few I'll try.

I don't know Judy trees. Do you have a reference/pointer?


Oops, "judy array"
  http://judy.sourceforge.net/
  http://en.wikipedia.org/wiki/Judy_array
and I did a (buggy as it turns out) wrapper at
  http://www.dalkescientific.com/Python/PyJudy.html
when I last looked into substructure fp filters.

My idea then and now was to store a mapping from:
   unique path identifier -> sorted list of matching compounds

Substructure filtering is the same as generating all paths
and finding the intersection of the sorted lists.

I think this is called an inverted index. Most paths are
rare, so storing all those paths doesn't take much space.

I was thinking that a sorted list works better than a
hash or normal trie because I could do an N-way merge
to find the intersection, rather than a lot of membership
tests. But in reflection, the latter may be faster.
Looks like more testing will occur.

They aren't by any chance connected to the thing presented in Andrew
Smellie's recent paper (haven't read it yet)?
http://pubs.acs.org/doi/abs/10.1021/ci800325v


Not at all. I really need to visit the library soon.
Or pay $30 for 24 hour access to ACS, plus unknown
price for access to Ullmann's paper.

I think it's worth looking into branched paths as well for real
substructure searches. People don't query with linear fragments all
that often, so it seems like it would be a win.


While people don't query with liner fragments, more complex
structures contain linear subparts, including crossing paths.

My thought was that linear paths are easy to generate and
canonicalize, and would give a baseline limit to more
sophisticated schemes.




                                Andrew
                                da...@dalkescientific.com



Reply via email to