Hi Anna, On Tue, Nov 22, 2016 at 10:34 AM, Anna Lena Wölke < annalenawoe...@googlemail.com> wrote:
> > I want to cluster a set of molecules and then search for similar molecules > in a different set. What commands do I need? > The RDKit cookbook includes some information about one clustering strategy (Butina clustering) which is pretty good at dealing with large groups of molecules: http://rdkit.org/docs/Cookbook.html#clustering-molecules Another approach would be to generate the distance matrix yourself (that code sample above shows how to calculate distances) and then use one of the many methods available in scikit learn ( http://scikit-learn.org/stable/modules/clustering.html) Both of these assume that you have fingerprints to use for calculating similarity from. Here's a section of the documentation on the fingerprinting functions: http://rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity (note that Scikit learn probably expects the distance matrix in a different form than the RDKit's Butina clustering code) To find similar molecules in a different set you could use the same fingerprinting functions and then use the BulkTanimotoSimliarity function used in the example above. That's assuming that there aren't too many molecules in the other set. If you have more than 100K or so fingerprints to search through, then you'd probably want to use a different approach, which I can explain if it's relevant. -greg
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss