On Sun, Nov 27, 2016 at 8:45 AM, Chris Swain <sw...@mac.com> wrote: > I added the similarity scores > by adding an extra line, > > sdf['sim']=DataStructs.BulkTanimotoSimilarity(ionised_fps,sdf['mfp2’]) >
Yes, that's what I would have done. > I don’t know if it could be done in a single line? > You don't know if what could be done as a single line? -greg > Chris > > On 26 Nov 2016, at 04:48, Greg Landrum <greg.land...@gmail.com> wrote: > > That's a good question. > > I'm not a master of pandas indexing, but this seems to work: > In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) > for x in sdf['ROMol']] > In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2']) > In [13]: ids = [x for x,y in enumerate(sims) if y>0.5] > In [18]: ndf = sdf.iloc[ids] > In [19]: len(ndf) > Out[19]: 3 > > The question is whether or not that's actually faster. > > In [21]: def filt1(sdf,qry): > ...: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2']) > ...: ids = [x for x,y in enumerate(sims) if y>0.5] > ...: return sdf.iloc[ids] > ...: > > In [22]: def filt2(sdf,qry): > ...: return sdf[sdf.apply(lambda x:DataStructs. > TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)] > ...: > > In [25]: %timeit filt1(sdf,qry) > 1 loop, best of 3: 458 ms per loop > In [28]: %timeit filt2(sdf,qry) > 1 loop, best of 3: 798 ms per loop > > And it certainly is . > > -greg > > > > On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck <peter.ged...@gmail.com> > wrote: > >> Is it possible to use the bulk similarity searching functionality for >> better performance instead of the list comprehension? >> >> Best, >> >> Peter >> >> >> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum <greg.land...@gmail.com> >> wrote: >> >> No worries. >> This, and Anna's question about similarity searching and clustering >> illustrate a great opportunity for a tutorial on fingerprints and >> similarity searching. >> >> -greg >> >> >> >> >> >> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" <sw...@mac.com> >> wrote: >> >> Thanks for this, >> >> As a chemist who comes from the “cut and paste” school of scripting I’m >> always concerned I’m asking something blindingly obvious >> >> ;-) >> >> Chris >> >> On 23 Nov 2016, at 12:36, Greg Landrum <greg.land...@gmail.com> wrote: >> >> [including rdkit-discuss, because it's relevant there and I'm pretty sure >> Chris won't mind and the real Pandas experts may have a better answer than >> me.] >> >> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain <sw...@mac.com> wrote: >> >> >> I quite like storing molecules and associated data in a data frame and >> I’ve see that it is possible to use rdkit for substructure searching, it is >> possible to also do similarity searching? >> >> >> It's not built in since there are many possible fingerprints that could >> be used. >> >> It's not quite as convenient as the substructure search, but here's a >> little demo of what you can do to filter based on similarity: >> >> # Start by adding a fingerprint column: >> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) >> for x in df['ROMol']] >> >> # and now filter: >> In [21]: ndf =df[df.apply(lambda x: >> DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, >> axis=1)] >> >> In [23]: len(df) >> Out[23]: 1000 >> In [24]: len(ndf) >> Out[24]: 2 >> >> -greg >> >> >> ------------------------------------------------------------ >> ------------------ >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss