Re: [Rdkit-discuss] Optimizing SSS in the RDKit

Greg Landrum Thu, 12 Feb 2009 07:46:28 +0000

On Thu, Feb 12, 2009 at 4:38 AM, Andrew Dalke <da...@dalkescientific.com> wrote:
> On Feb 10, 2009, at 3:11 PM, Greg Landrum wrote:
>
>> It would be quite interesting to use the new Ullmann code as a
>> framework and do an implementation of the VF or VF2 algorithms used in
>> vflib.
>>
>
> I was surprised to see that there's a public BGL implementation
> of Ullman. I've seen mentions that people had in-house versions
> but that was it. I admit though that I haven't tracked CDL.


Yeah, I was also surprised to see that. I had seen the CDL paper but
neither read it carefully nor looked at the code. It was a pleasant
surprise.
A minor nit: it is the "Ullmann" algorithm, not "Ullman". I also
thought it was Ullman until I got a copy of the original paper.

> I see that this Ullman implementation is faster than the VFLib-based
> one. The CDK people have different experience with their
> implementations. But you wrote that you had to convert between
> the BGL and VFLib graphs, which explains at least part of the
> advantage.

I think the conversion is all of the time difference. The VF2
algorithm is almost definitely superior to Ullmann for subgraph
isomorphism in molecules. If we assume that the VFLib implementation
of Ullmann is at least mostly as good as their implementation of VF2,
you can see the difference in those timings. Once I have time to do a
BGL implementation of VF2 (or find an implementation online), we'll
have another 1:1 comparison.

>> To get a set of "molecule-like" queries, I fragmented
>> those 1000 molecules using RECAP and kept the 823 unique fragments I
>> got.
>>
>
> I've wondered if it would be possible to get a set of
> representative structure queries from PubChem or ChemSpider.
> But I figured there would be some privacy issues because some
> people might have inappropriately submitted internal proprietary
> structures. I know when I did bioinformatics work people did that
> with their sequences. Bioinformatics isn't as secretive as
> cheminformatics though.

With PubChem it doesn't seem like people should have the slightest
expectation of privacy, but you're right that the waters are murky.
It's probably better to use either synthetic queries (as I did) or
find a public source of some better ones.

>> The layered fps are clearly more robust w.r.t. fingerprint size (which
>> makes sense: I only set one bit per path there as opposed to 4 per
>> path for the RDKit fp; a good experiment would be to try the RDKit fps
>> with one bit per path).
>>
>
> There ought to be some math that can help out. I tried
> working it out but came to the conclusion that there must
> be a large set of well correlated bits. Why? Because if p
> and q are the probabilities that bits are set in the query
> and the target, then
>
> P(a given bit screens)
>       = P(query bit is on) * P(target bit is off)
>       = p * (1-q)
>
> P(query is a subset of the target) =
>       = (1-P(a given bit screens))**(number of on-bits)
>       = (1-p*(1-q))**(N*p)
>
> and if p and q were, say, 0.2 then the odds of a random
> query fingerprint being a subset of a target fingerprint
> is about 1E-32. While Greg measured about 7%.

I'm either not understanding completely or I disagree. The queries
were constructed by fragmenting the molecules I searched through, so
I'd expect lots of substructure hits (and a lower screen-out rate that
arbitrary queries against arbitrary molecules). What I ought to do
(and will, later today) is check to see how many "false positives" I
get from the screening: i.e. how many molecule--query pairs pass the
fingerprint screen but do not contain an actual substructure match.

> That leads to one thought - if you were to filter out
> the top few hundred or so subgraph seeds (the characteristic
> values you use to see the PRNG), would that make the
> filters more selective?
>
> Handwaving, "C" and "CC" occur in nearly everything
> then those bits aren't very selective. Since there are
> several hundred subgraphs that can map to a given bit,
> these common bits obscure more selective tests.

That's a good idea to add to the list of things to look into. It's
also relatively easy to do because it probably just involves
increasing the minimum path length included in fingerprints (at least
as a first step).

> Hmmm, what does a correlation matrix plot between
> the fingerprint bits look like?

No clue, but it's a good question.

> You compared the two different hash fingerprint types
> to see if their combination was more effective than one
> along. What about doing the same with the MACCS keys?
> The combination of two different types of filters may
> prove better.

Looking at MACCS is a good idea. I'll also put that on the list.

-greg

Re: [Rdkit-discuss] Optimizing SSS in the RDKit

Reply via email to