Re: [Rdkit-discuss] Optimizing SSS in the RDKit

Andrew Dalke Thu, 12 Feb 2009 03:38:20 +0000

On Feb 10, 2009, at 3:11 PM, Greg Landrum wrote:

It would be quite interesting to use the new Ullmann code as a
framework and do an implementation of the VF or VF2 algorithms used in
vflib.


I was surprised to see that there's a public BGL implementation
of Ullman. I've seen mentions that people had in-house versions
but that was it. I admit though that I haven't tracked CDL.

I see that this Ullman implementation is faster than the VFLib-based
one. The CDK people have different experience with their
implementations. But you wrote that you had to convert between
the BGL and VFLib graphs, which explains at least part of the
advantage.

To get a set of "molecule-like" queries, I fragmented
those 1000 molecules using RECAP and kept the 823 unique fragments I
got.


I've wondered if it would be possible to get a set of
representative structure queries from PubChem or ChemSpider.
But I figured there would be some privacy issues because some
people might have inappropriately submitted internal proprietary
structures. I know when I did bioinformatics work people did that
with their sequences. Bioinformatics isn't as secretive as
cheminformatics though.

The layered fps are clearly more robust w.r.t. fingerprint size (which
makes sense: I only set one bit per path there as opposed to 4 per
path for the RDKit fp; a good experiment would be to try the RDKit fps
with one bit per path).


There ought to be some math that can help out. I tried
working it out but came to the conclusion that there must
be a large set of well correlated bits. Why? Because if p
and q are the probabilities that bits are set in the query
and the target, then

P(a given bit screens)
      = P(query bit is on) * P(target bit is off)
      = p * (1-q)

P(query is a subset of the target) =
      = (1-P(a given bit screens))**(number of on-bits)
      = (1-p*(1-q))**(N*p)

and if p and q were, say, 0.2 then the odds of a random
query fingerprint being a subset of a target fingerprint
is about 1E-32. While Greg measured about 7%.

That leads to one thought - if you were to filter out
the top few hundred or so subgraph seeds (the characteristic
values you use to see the PRNG), would that make the
filters more selective?

Handwaving, "C" and "CC" occur in nearly everything
then those bits aren't very selective. Since there are
several hundred subgraphs that can map to a given bit,
these common bits obscure more selective tests.

Hmmm, what does a correlation matrix plot between
the fingerprint bits look like?

I haven't gotten my hands dirty, as it were, with this
sort of data so I don't have a good intuition for what
works well and where limitations are.

So what does "the community" think? Interesting results? Arguments
about my testing/benchmarking methodology? Obvious next steps?
Suggestions for improving the layered fps so that they're even more
discriminating?


You compared the two different hash fingerprint types
to see if their combination was more effective than one
along. What about doing the same with the MACCS keys?
The combination of two different types of filters may
prove better.

Otherwise, I don't have any good suggestions.

                                Andrew
                                da...@dalkescientific.com

Re: [Rdkit-discuss] Optimizing SSS in the RDKit

Reply via email to