On Feb 10, 2009, at 3:11 PM, Greg Landrum wrote:
It would be quite interesting to use the new Ullmann code as a framework and do an implementation of the VF or VF2 algorithms used in vflib.
I was surprised to see that there's a public BGL implementation of Ullman. I've seen mentions that people had in-house versions but that was it. I admit though that I haven't tracked CDL. I see that this Ullman implementation is faster than the VFLib-based one. The CDK people have different experience with their implementations. But you wrote that you had to convert between the BGL and VFLib graphs, which explains at least part of the advantage.
To get a set of "molecule-like" queries, I fragmented those 1000 molecules using RECAP and kept the 823 unique fragments I got.
I've wondered if it would be possible to get a set of representative structure queries from PubChem or ChemSpider. But I figured there would be some privacy issues because some people might have inappropriately submitted internal proprietary structures. I know when I did bioinformatics work people did that with their sequences. Bioinformatics isn't as secretive as cheminformatics though.
The layered fps are clearly more robust w.r.t. fingerprint size (which makes sense: I only set one bit per path there as opposed to 4 per path for the RDKit fp; a good experiment would be to try the RDKit fps with one bit per path).
There ought to be some math that can help out. I tried working it out but came to the conclusion that there must be a large set of well correlated bits. Why? Because if p and q are the probabilities that bits are set in the query and the target, then P(a given bit screens) = P(query bit is on) * P(target bit is off) = p * (1-q) P(query is a subset of the target) = = (1-P(a given bit screens))**(number of on-bits) = (1-p*(1-q))**(N*p) and if p and q were, say, 0.2 then the odds of a random query fingerprint being a subset of a target fingerprint is about 1E-32. While Greg measured about 7%. That leads to one thought - if you were to filter out the top few hundred or so subgraph seeds (the characteristic values you use to see the PRNG), would that make the filters more selective? Handwaving, "C" and "CC" occur in nearly everything then those bits aren't very selective. Since there are several hundred subgraphs that can map to a given bit, these common bits obscure more selective tests. Hmmm, what does a correlation matrix plot between the fingerprint bits look like? I haven't gotten my hands dirty, as it were, with this sort of data so I don't have a good intuition for what works well and where limitations are.
So what does "the community" think? Interesting results? Arguments about my testing/benchmarking methodology? Obvious next steps? Suggestions for improving the layered fps so that they're even more discriminating?
You compared the two different hash fingerprint types to see if their combination was more effective than one along. What about doing the same with the MACCS keys? The combination of two different types of filters may prove better. Otherwise, I don't have any good suggestions. Andrew da...@dalkescientific.com