Sorry, I had to have said that I use C++. I search a bunch of substructures
(sometimes SMILES, sometimes SMARTS) against a lot of molecules.
Thanks a lot,
Gonzalo
From: Greg Landrum [mailto:[email protected]]
Sent: 29 May 2013 05:41
To: Gonzalo Colmenarejo-Sanchez
Cc: [email protected]
Subject: Re: [Rdkit-discuss] substructure search with fingerprints
Hi Gonzalo,
On Tue, May 28, 2013 at 5:00 PM, Gonzalo Colmenarejo-Sanchez
<[email protected]<mailto:[email protected]>> wrote:
What's the best way of doing fast (approximate) substructure searches in RDKit
using fingerprints? I'm a bit confused about this topic. Any advice would be
really appreciated.
The answer depends on what you want to do.
If you have one or more molecules and a single query and you want to know if
the query matches any the molecules, the fastest approach is just to do the
substructure search (the time required to generate the fingerprints is larger
than the time to do the individual search).
If you have a set of molecules you would like to search through using multiple
queries or a set that is relatively static that you'd be searching through more
than once, you have a variety of options. I'm going to run through some of the
options from Python. If you want to do the same thing in C++ or Java, I can
provide a separate answer for that.
-----------------------------
1) Install postgresql and the RDKit postgresql cartridge and use that to do the
searches. This is heavyweight, but gets you something that's flexible,
relatively easy to use, and quite suited for dealing with millions of molecules.
-----------------------------
2) Give Riccardo's Chemicallite a
try:http://www.mail-archive.com/[email protected]/msg03077.html
This "cartridge" for sqlite is still in development, but the early results
that Riccardo shows look quite promising.
-----------------------------
3) Using the pandas integration in the new version of the RDKit, you can easily
work with sets of molecules and do efficient substructure searches:
In [47]: from rdkit.Chem import PandasTools
In [48]: df =
PandasTools.LoadSDF('lopac_pubchem_28March07.sdf',includeFingerprints=True)
len(
In [49]: len(df)
Out[49]: 1232
In [50]: q = Chem.MolFromSmiles('c1nnccc1')
In [51]: subset = ndf[ndf['ROMol']>=q]
In [52]: len(subset)
Out[52]: 6
If you want to use this set of molecules in later python sessions, you can save
the dataframe using python's pickle module.
Needless to say, you'll need to have pandas installed (but it's great to have
installed anyway).
-----------------------------
4) If you want to avoid installing anything extra, you can do the book-keeping
and fingerprint tracking yourself with something like this:
In [63]: ms = [x for x in Chem.SDMolSupplier('lopac_pubchem_28March07.sdf') if
x is not None]
fps
In [64]: fps = [Chem.PatternFingerprint(x) for x in ms]
In [65]: def sss(ms,fps,q):
res=[]
qfp = Chem.PatternFingerprint(q)
for i,fp in enumerate(fps):
if DataStructs.AllProbeBitsMatch(qfp,fp):
if ms[i].HasSubstructMatch(q):
res.append(ms[i])
return res
....:
In [66]: subset=sss(ms,fps,Chem.MolFromSmiles('c1nnccc1'))
In [67]: len(subset)
Out[67]: 6
You can pickle the lists ms and fps together to use them in later python
sessions.
Note that solutions 3) and 4) need to have all the molecules and fingerprints
in memory at the same time, so dealing with large numbers of molecules this way
will not be particularly efficient unless you have a *lot* of memory.
Does that help?
-greg
------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss