Re: [Rdkit-discuss] Butina clustering with additional output

Peter S. Shenkin Wed, 26 Sep 2018 11:27:40 -0700

Ah, David, but how do you define a "real" singleton?

-P.


On Wed, Sep 26, 2018 at 1:30 PM David Cosgrove <davidacosgrov...@gmail.com>
wrote:

> Slightly off topic, but a minor issue with the Taylor-Butina algorithm is
> that it generates “false singletons”. These are molecules just outside the
> clustering cutoff that are stranded when their neighbours are put in a
> different, larger cluster. We used to find it convenient to have a sweep of
> these, at a slightly looser cutoff, and drop them into the cluster whose
> centroid/seed they were nearest too. This could be added to Andrew’s code
> quite easily. At the very least, it’s worth keeping track of the initial
> number of neighbours within the cluster cutoff that each fingerprint had so
> as to distinguish real singletons from these artefactual ones.
> Dave
>
>
> On Tue, 25 Sep 2018 at 19:56, Peter S. Shenkin <shen...@gmail.com> wrote:
>
>> Well, I'm not really familiar with the Taylor-Butina clustering method,
>> so I'm proposing a methodology based on generalizing something that I found
>> to be useful in a somewhat different clustering context.
>>
>> Presuming that what you are clustering is the fingerprints of structures,
>> and that you know which structures are in each cluster, you'd compute the
>> average of all the fingerprints. That is, each bit position would be given
>> a floating point number that is the average of the 0s and 1s at that
>> position computed over the structures in the cluster.  Then you'd compute
>> the distance (say, Manhattan or Euclidian) between the fingerprint of each
>> structure in the cluster and the average so computed. The "most
>> representative structure" would be the cluster member whose distance is
>> closest to the cluster's average fingerprint. (Some additional mileage
>> could be gained by seeing just how far away from the averag the "most
>> representative structures" are. It might be more representative (i.e.,
>> closer) for some clusters than for others.
>>
>> It would make sense to try this (since it's easy enough) and see whether
>> the resulting "most representative structures" from the clusters really are
>> at least roughly representative, by comparing them with viewable random
>> subsets of structures from the clusters.
>>
>> -P.
>>
>> On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke <da...@dalkescientific.com>
>> wrote:
>>
>>> On Sep 25, 2018, at 17:13, Peter S. Shenkin <shen...@gmail.com> wrote:
>>> > FWIW, in work on conformational clustering, I used the “most
>>> representative” molecule; that is, the real molecule closest to the
>>> mathematical centroid. This would probably be the best way of displaying a
>>> single molecule that typifies what is in the cluster.
>>>
>>> In some sense I'm rephrasing Chris Earnshaw's earlier question - how
>>> does one do that with Taylor-Butina clustering? And does it make sense?
>>>
>>> The algorithm starts by picking a centroid based on the fingerprints
>>> with the highest number of neighbors, so none of the other cluster members
>>> should have more neighbors within that cutoff.
>>>
>>> I am far from an expert on this topic, but with any alternative I can
>>> think of makes me think I should have started with something other than
>>> Taylor-Butina.
>>>
>>>
>>>
>>>                                 Andrew
>>>                                 da...@dalkescientific.com
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Butina clustering with additional output

Reply via email to