Thanks for this, Greg. In my experience Butina's (sphere exclusion)
clustering produces more coherent clusters but adapts worse to small
variations within a given chemotype. Complete linkage is more flexible and
robust and the one I prefer, although it is slower. Btw, what's the
difference of ML.Clustering with Chem.Fingerprints.ClusterMols?

Thanks

Gonzalo

On Tue, Jul 19, 2016 at 1:44 PM, Greg Landrum <greg.land...@gmail.com>
wrote:

> Hi Gonzalo,
>
>
> On Mon, Jul 18, 2016 at 9:54 AM, Gonzalo Colmenarejo <
> colmenarejo.gonz...@gmail.com> wrote:
>
>>
>> I have succeeded in running a clustering of a set of molecules with the
>> Complete Link Hierarchical clustering algorithm in RDKit. However, what I
>> obtain is a clusters hierarchy object. I'd like to figure out now how to
>> assign molecules to clusters for a particular similarity cutoff in the
>> Complete Link algorithm (rather than provide the system with the number of
>> clusters). Does anyone know how to do it?
>>
>
> That's a good question, and one I had to think about for a bit in order to
> come up with an answer.
>
> Here's a notebook showing how I solved the problem:
> https://gist.github.com/greglandrum/6ff63e602b33d3c90d5b41325a4791ce
>
> The key is to know that the Cluster object's GetMetric() method returns
> whatever the merge metric was for that particular cluster. For Complete
> Linkage this corresponds to the largest distance (lowest similarity)
> between points in the cluster. You can recurse through the cluster tree
> using GetMetric() to pick out the sub-trees that are within your desired
> cutoff value (this is the look()) function in my notebook. Recursing
> through those trees to get the leaves (the get_leaves() function in my
> notebook) allows you to get the indices of the molecules.
>
> This is likely to turn into an RDKit blog post (probably comparing the
> sk-learn clustering with the RDKit clustering); it's an interesting little
> problem and the solution could be pretty useful for comparing the output of
> hierarchical methods with things like Butina clustering.
>
> Best,
> -greg
>
>
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to