Hi Anna,

On Tue, Nov 22, 2016 at 10:34 AM, Anna Lena Wölke <
annalenawoe...@googlemail.com> wrote:

>
> I want to cluster a set of molecules and then search for similar molecules
> in a different set. What commands do I need?
>

The RDKit cookbook includes some information about one clustering strategy
(Butina clustering) which is pretty good at dealing with large groups of
molecules:
http://rdkit.org/docs/Cookbook.html#clustering-molecules
Another approach would be to generate the distance matrix yourself (that
code sample above shows how to calculate distances) and then use one of the
many methods available in scikit learn (
http://scikit-learn.org/stable/modules/clustering.html)
Both of these assume that you have fingerprints to use for calculating
similarity from. Here's a section of the documentation on the
fingerprinting functions:
http://rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity
(note that Scikit learn probably expects the distance matrix in a different
form than the RDKit's Butina clustering code)

To find similar molecules in a different set you could use the same
fingerprinting functions and then use the BulkTanimotoSimliarity function
used in the example above. That's assuming that there aren't too many
molecules in the other set. If you have more than 100K or so fingerprints
to search through, then you'd probably want to use a different approach,
which I can explain if it's relevant.

-greg
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to