Peter, If you have chemfp and can make a chemfp arena, RDKit now supports these structures for reading and searching. This, by far, is the fastest way I know of similarity searching. I believe that Greg's implementation is compatible with chemfp 1.0 which is available on pypi:
https://pypi.python.org/pypi/chemfp/1.0 In my copious spare time, I've been trying to think of ways to embed this directly in a pandas dataframe however, using them side by side is certainly doable. Cheers, Brian On Wed, Nov 23, 2016 at 10:06 AM, Peter Gedeck <peter.ged...@gmail.com> wrote: > Is it possible to use the bulk similarity searching functionality for > better performance instead of the list comprehension? > > Best, > > Peter > > > On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum <greg.land...@gmail.com> > wrote: > > No worries. > This, and Anna's question about similarity searching and clustering > illustrate a great opportunity for a tutorial on fingerprints and > similarity searching. > > -greg > > > > > > On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" <sw...@mac.com> > wrote: > > Thanks for this, > > As a chemist who comes from the “cut and paste” school of scripting I’m > always concerned I’m asking something blindingly obvious > > ;-) > > Chris > > On 23 Nov 2016, at 12:36, Greg Landrum <greg.land...@gmail.com> wrote: > > [including rdkit-discuss, because it's relevant there and I'm pretty sure > Chris won't mind and the real Pandas experts may have a better answer than > me.] > > On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain <sw...@mac.com> wrote: > > > I quite like storing molecules and associated data in a data frame and > I’ve see that it is possible to use rdkit for substructure searching, it is > possible to also do similarity searching? > > > It's not built in since there are many possible fingerprints that could be > used. > > It's not quite as convenient as the substructure search, but here's a > little demo of what you can do to filter based on similarity: > > # Start by adding a fingerprint column: > In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) > for x in df['ROMol']] > > # and now filter: > In [21]: ndf =df[df.apply(lambda x: DataStructs. > TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)] > > In [23]: len(df) > Out[23]: 1000 > In [24]: len(ndf) > Out[24]: 2 > > -greg > > > ------------------------------------------------------------ > ------------------ > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss