Re: [Rdkit-discuss] Pandas

Greg Landrum Sat, 26 Nov 2016 23:57:08 -0800

On Sun, Nov 27, 2016 at 8:45 AM, Chris Swain <[email protected]> wrote:

> I added the similarity scores
> by adding an extra line,
>
> sdf['sim']=DataStructs.BulkTanimotoSimilarity(ionised_fps,sdf['mfp2’])
>


Yes, that's what I would have done.


> I don’t know if it could be done in a single line?
>

You don't know if what could be done as a single line?

-greg



> Chris
>
> On 26 Nov 2016, at 04:48, Greg Landrum <[email protected]> wrote:
>
> That's a good question.
>
> I'm not a master of pandas indexing, but this seems to work:
> In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
> for x in sdf['ROMol']]
> In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> In [13]: ids = [x for x,y in enumerate(sims) if y>0.5]
> In [18]: ndf = sdf.iloc[ids]
> In [19]: len(ndf)
> Out[19]: 3
>
> The question is whether or not that's actually faster.
>
> In [21]: def filt1(sdf,qry):
>     ...:     sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
>     ...:     ids = [x for x,y in enumerate(sims) if y>0.5]
>     ...:     return sdf.iloc[ids]
>     ...:
>
> In [22]: def filt2(sdf,qry):
>     ...:     return sdf[sdf.apply(lambda x:DataStructs.
> TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)]
>     ...:
>
> In [25]: %timeit filt1(sdf,qry)
> 1 loop, best of 3: 458 ms per loop
> In [28]: %timeit filt2(sdf,qry)
> 1 loop, best of 3: 798 ms per loop
>
> And it certainly is .
>
> -greg
>
>
>
> On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck <[email protected]>
> wrote:
>
>> Is it possible to use the bulk similarity searching functionality for
>> better performance instead of the list comprehension?
>>
>> Best,
>>
>> Peter
>>
>>
>> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum <[email protected]>
>> wrote:
>>
>> No worries.
>> This, and Anna's question about similarity searching and clustering
>> illustrate a great opportunity for a tutorial on fingerprints and
>> similarity searching.
>>
>> -greg
>>
>>
>>
>>
>>
>> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" <[email protected]>
>> wrote:
>>
>> Thanks for this,
>>
>> As a chemist who comes from the “cut and paste” school of scripting I’m
>> always concerned I’m asking something blindingly obvious
>>
>> ;-)
>>
>> Chris
>>
>> On 23 Nov 2016, at 12:36, Greg Landrum <[email protected]> wrote:
>>
>> [including rdkit-discuss, because it's relevant there and I'm pretty sure
>> Chris won't mind and the real Pandas experts may have a better answer than
>> me.]
>>
>> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain <[email protected]> wrote:
>>
>>
>> I quite like storing molecules and associated data in a data frame and
>> I’ve see that it is possible to use rdkit for substructure searching, it is
>> possible to also do similarity searching?
>>
>>
>> It's not built in since there are many possible fingerprints that could
>> be used.
>>
>> It's not quite as convenient as the substructure search, but here's a
>> little demo of what you can do to filter based on similarity:
>>
>> # Start by adding a fingerprint column:
>> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
>> for x in df['ROMol']]
>>
>> # and now filter:
>> In [21]: ndf =df[df.apply(lambda x: 
>> DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7,
>> axis=1)]
>>
>> In [23]: len(df)
>> Out[23]: 1000
>> In [24]: len(ndf)
>> Out[24]: 2
>>
>> -greg
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> _______________________________________________
>> Rdkit-discuss mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
>

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Pandas

Reply via email to