Re: [Rdkit-discuss] substructure search with fingerprints

Gonzalo Colmenarejo-Sanchez Wed, 29 May 2013 00:56:44 -0700

Sorry,  I had to have said that I use C++. I search a bunch of substructures 
(sometimes SMILES, sometimes SMARTS) against a lot of molecules.

Thanks a lot,

Gonzalo

From: Greg Landrum [mailto:[email protected]]
Sent: 29 May 2013 05:41
To: Gonzalo Colmenarejo-Sanchez
Cc: [email protected]
Subject: Re: [Rdkit-discuss] substructure search with fingerprints

Hi Gonzalo,

On Tue, May 28, 2013 at 5:00 PM, Gonzalo Colmenarejo-Sanchez 
<[email protected]<mailto:[email protected]>> wrote:

What's the best way of doing fast (approximate) substructure searches in RDKit 
using fingerprints? I'm a bit confused about this topic. Any advice would be 
really appreciated.

The answer depends on what you want to do.

If you have one or more molecules and a single query and you want to know if 
the query matches any the molecules, the fastest approach is just to do the 
substructure search (the time required to generate the fingerprints is larger 
than the time to do the individual search).

If you have a set of molecules you would like to search through using multiple 
queries or a set that is relatively static that you'd be searching through more 
than once, you have a variety of options. I'm going to run through some of the 
options from Python. If you want to do the same thing in C++ or Java, I can 
provide a separate answer for that.

-----------------------------
1) Install postgresql and the RDKit postgresql cartridge and use that to do the 
searches. This is heavyweight, but gets you something that's flexible, 
relatively easy to use, and quite suited for dealing with millions of molecules.

-----------------------------
2) Give Riccardo's Chemicallite a 
try:http://www.mail-archive.com/[email protected]/msg03077.html
 This "cartridge" for sqlite is still in development, but the early results 
that Riccardo shows look quite promising.

-----------------------------
3) Using the pandas integration in the new version of the RDKit, you can easily 
work with sets of molecules and do efficient substructure searches:
In [47]: from rdkit.Chem import PandasTools

In [48]: df = 
PandasTools.LoadSDF('lopac_pubchem_28March07.sdf',includeFingerprints=True)
len(
In [49]: len(df)
Out[49]: 1232

In [50]: q = Chem.MolFromSmiles('c1nnccc1')

In [51]: subset = ndf[ndf['ROMol']>=q]

In [52]: len(subset)
Out[52]: 6

If you want to use this set of molecules in later python sessions, you can save 
the dataframe using python's pickle module.

Needless to say, you'll need to have pandas installed (but it's great to have 
installed anyway).

-----------------------------
4) If you want to avoid installing anything extra, you can do the book-keeping 
and fingerprint tracking yourself with something like this:

In [63]: ms = [x for x in Chem.SDMolSupplier('lopac_pubchem_28March07.sdf') if 
x is not None]
fps
In [64]: fps = [Chem.PatternFingerprint(x) for x in ms]

In [65]: def sss(ms,fps,q):
    res=[]
    qfp = Chem.PatternFingerprint(q)
    for i,fp in enumerate(fps):
        if DataStructs.AllProbeBitsMatch(qfp,fp):
            if ms[i].HasSubstructMatch(q):
                res.append(ms[i])
    return res
   ....:

In [66]: subset=sss(ms,fps,Chem.MolFromSmiles('c1nnccc1'))

In [67]: len(subset)
Out[67]: 6

You can pickle the lists ms and fps together to use them in later python 
sessions.

Note that solutions 3) and 4) need to have all the molecules and fingerprints 
in memory at the same time, so dealing with large numbers of molecules this way 
will not be particularly efficient unless you have a *lot* of memory.

Does that help?
-greg

------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] substructure search with fingerprints

Reply via email to