That's a good question.
I'm not a master of pandas indexing, but this seems to work:
In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
for x in sdf['ROMol']]
In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
In [13]: ids = [x for x,y in enumerate(sims) if y>0.5]
In [18]: ndf = sdf.iloc[ids]
In [19]: len(ndf)
Out[19]: 3
The question is whether or not that's actually faster.
In [21]: def filt1(sdf,qry):
...: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
...: ids = [x for x,y in enumerate(sims) if y>0.5]
...: return sdf.iloc[ids]
...:
In [22]: def filt2(sdf,qry):
...: return sdf[sdf.apply(lambda
x:DataStructs.TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)]
...:
In [25]: %timeit filt1(sdf,qry)
1 loop, best of 3: 458 ms per loop
In [28]: %timeit filt2(sdf,qry)
1 loop, best of 3: 798 ms per loop
And it certainly is .
-greg
On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck <[email protected]>
wrote:
> Is it possible to use the bulk similarity searching functionality for
> better performance instead of the list comprehension?
>
> Best,
>
> Peter
>
>
> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum <[email protected]>
> wrote:
>
> No worries.
> This, and Anna's question about similarity searching and clustering
> illustrate a great opportunity for a tutorial on fingerprints and
> similarity searching.
>
> -greg
>
>
>
>
>
> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" <[email protected]>
> wrote:
>
> Thanks for this,
>
> As a chemist who comes from the “cut and paste” school of scripting I’m
> always concerned I’m asking something blindingly obvious
>
> ;-)
>
> Chris
>
> On 23 Nov 2016, at 12:36, Greg Landrum <[email protected]> wrote:
>
> [including rdkit-discuss, because it's relevant there and I'm pretty sure
> Chris won't mind and the real Pandas experts may have a better answer than
> me.]
>
> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain <[email protected]> wrote:
>
>
> I quite like storing molecules and associated data in a data frame and
> I’ve see that it is possible to use rdkit for substructure searching, it is
> possible to also do similarity searching?
>
>
> It's not built in since there are many possible fingerprints that could be
> used.
>
> It's not quite as convenient as the substructure search, but here's a
> little demo of what you can do to filter based on similarity:
>
> # Start by adding a fingerprint column:
> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
> for x in df['ROMol']]
>
> # and now filter:
> In [21]: ndf =df[df.apply(lambda x: DataStructs.
> TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
>
> In [23]: len(df)
> Out[23]: 1000
> In [24]: len(ndf)
> Out[24]: 2
>
> -greg
>
>
> ------------------------------------------------------------
> ------------------
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss