Hi Greg,

just a curiosity ...

765534 vs 76522

is one a subset of the other? If not - would it make sense to test on 
both?

Just a thought. Apart from that I think the setup is reasonable for most 
applications we will have ...

Nik




Greg Landrum <greg.land...@gmail.com> 
10.02.2009 15:11

To
RDKit Discuss <rdkit-discuss@lists.sourceforge.net>
cc

Subject
[Rdkit-discuss] Optimizing SSS in the RDKit






Dear all

Andrew's question about fingerprints hit me at the right time: I had
just finished doing some optimization work on the RDKit substructure
search machinery (removing the vflib dependency). The details are
here:
http://code.google.com/p/rdkit/wiki/SubgraphIsomorphismOptimization

It would be quite interesting to use the new Ullmann code as a
framework and do an implementation of the VF or VF2 algorithms used in
vflib.

Of course there's no better way to optimize subgraph isomorphism than
to avoid it all together, which is where the fingerprints mentioned
come in. I'm spending a couple of days home from work (with a cold),
so I have some room to explore here a little bit.

I put together a sandbox using my 1000 pubchem molecules (they're from
the HTS set, so they are all either drug-like or lead-like, whatever
that means). To get a set of "molecule-like" queries, I fragmented
those 1000 molecules using RECAP and kept the 823 unique fragments I
got.

I've been using those 823 molecules to query the full set of 1000
molecules and looking at how many calls to the isomorphism code I can
avoid using either the RDKit (daylight-like) fingerprints or the
layered fingerprints (out to layer 0x4, beyond that these aren't
suitable for SSS).

The results look pretty encouraging: I can easily filter out more than
90% of the comparisons via fingerprints without losing anything. There
are 823000 (823x1000) possible comparisons with my dataset; using the
RDKit fingerprints as a screen I filter out 765534 of them (93%) using
the layered fingerprints I filter out 765224 (also 93%). The screening
[not even remotely optimized, I'm calculating (A&B)==A instead of
doing it on the fly and short circuiting when something mismatches]
takes about 10 seconds in each case.

By default each fingerprint uses 2048 bits. I can shrink this by
folding the fingerprints (or generating them shorter in the first
place... the end result is the same). That potentially gains speed and
certainly saves storage space, but there may be a cost at how
discriminating the fingerprints are.

Experiment 1: reduce fps to 1024 bits
RDK fingerprints: filter out 717356 (87%)
Layered: filter out 752948 (91%)
No obvious speed improvement

Experiment 2: reduce fps to 512 bits
RDK fingerprints: filter out 441529 (54%)
Layered: filter out 710647 (86%)
10-15% faster

The layered fps are clearly more robust w.r.t. fingerprint size (which
makes sense: I only set one bit per path there as opposed to 4 per
path for the RDKit fp; a good experiment would be to try the RDKit fps
with one bit per path). They're also faster to generate (they no
longer require a PRNG).

I think the screening speed thing is a bit of a red herring at the
moment since I'm not doing a smart screen, but there is a real impact
on storage space.

So what does "the community" think? Interesting results? Arguments
about my testing/benchmarking methodology? Obvious next steps?
Suggestions for improving the layered fps so that they're even more
discriminating?

-greg

------------------------------------------------------------------------------
Create and Deploy Rich Internet Apps outside the browser with 
Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code 
to
build responsive, highly engaging applications that combine the power of 
local
resources and data with the reach of the web. Download the Adobe AIR SDK 
and
Ajax docs to start building applications 
today-http://p.sf.net/sfu/adobe-com
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


_________________________

CONFIDENTIALITY NOTICE

The information contained in this e-mail message is intended only for the 
exclusive use of the individual or entity named above and may contain 
information that is privileged, confidential or exempt from disclosure 
under applicable law. If the reader of this message is not the intended 
recipient, or the employee or agent responsible for delivery of the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please 
notify the sender immediately by e-mail and delete the material from any 
computer.  Thank you.

Reply via email to