Re: [Rdkit-discuss] Butina clustering with additional output

Peter S. Shenkin Tue, 25 Sep 2018 08:15:02 -0700

(I see that I accidentally responded to Andrew, only, earlier; I'm copying
to the group this time.)


FWIW, in work on conformational clustering, I used the “most
representative” molecule; that is, the real molecule closest to the
mathematical centroid. This would probably be the best way of displaying a
single molecule that typifies what is in the cluster.

-P.

On Tue, Sep 25, 2018 at 8:09 AM, Andrew Dalke <[email protected]>
wrote:

> On Sep 21, 2018, at 14:53, Philipp Thiel <[email protected]
> tuebingen.de> wrote:
> > you probably read about the Tanimoto being a proper metric in case of
> having binary data
> > in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in
> the revised edition.
>
> What we call Tanimoto is more broadly known as the Jaccard. Various sites
> demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric,
> such as https://mathoverflow.net/questions/18084/is-the-
> jaccard-distance-a-distance and https://arxiv.org/abs/1612.02696 .
>
> Going back to James T. Metz's original question, one alternative might be
> to use chemfp and the Taylor-Butina clustering implementation available at:
>
>   http://dalkescientific.com/writings/taylor_butina.py
>
> Following Dave Cosgrove's advice:
>
> > I expect James means what we used to call the cluster seed, i.e. the
> molecule the cluster was based on, rather than the mathematical centroid.
> Calculating distances from each cluster member to that would be quite
> straightforward as a post-processing step although that would roughly
> double the time taken.
>
> it's possible to change the reporting code from:
>
>     for centroid_idx, members in clusters:
>         print(arena.ids[centroid_idx], "has", len(members), "other
> members", file=outfile)
>         print("=>", " ".join(arena.ids[idx] for idx in members),
> file=outfile)
>
> so it does the post-processing:
>
>     print(len(clusters), "clusters", file=outfile)
>     for centroid_idx, members in clusters:
>         print(arena.ids[centroid_idx], "has", len(members), "other
> members", file=outfile)
>         subarena = arena.copy(indices=members)
>         centroid_fp = arena.get_fingerprint(centroid_idx)
>         result = subarena.threshold_tanimoto_search_fp(centroid_fp,
> threshold=0.0)
>         result.reorder()  # sort so the highest scores come first
>         for id, score in result.get_ids_and_scores():
>             print("=>", id, "score:", score)
>
>
> Cheers,
>
>                                 Andrew
>                                 [email protected]
>
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Butina clustering with additional output

Reply via email to