Re: [Rdkit-discuss] Butina clustering with additional output

Andrew Dalke Tue, 25 Sep 2018 05:11:07 -0700

On Sep 21, 2018, at 14:53, Philipp Thiel <[email protected]> 
wrote:
> you probably read about the Tanimoto being a proper metric in case of having 
> binary data
> in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the 
> revised edition.


What we call Tanimoto is more broadly known as the Jaccard. Various sites 
demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, 
such as 
https://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance and 
https://arxiv.org/abs/1612.02696 .

Going back to James T. Metz's original question, one alternative might be to 
use chemfp and the Taylor-Butina clustering implementation available at: 

  http://dalkescientific.com/writings/taylor_butina.py

Following Dave Cosgrove's advice: 

> I expect James means what we used to call the cluster seed, i.e. the molecule 
> the cluster was based on, rather than the mathematical centroid. Calculating 
> distances from each cluster member to that would be quite straightforward as 
> a post-processing step although that would roughly double the time taken. 

it's possible to change the reporting code from:

    for centroid_idx, members in clusters:
        print(arena.ids[centroid_idx], "has", len(members), "other members", 
file=outfile)
        print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile)

so it does the post-processing:

    print(len(clusters), "clusters", file=outfile)
    for centroid_idx, members in clusters:
        print(arena.ids[centroid_idx], "has", len(members), "other members", 
file=outfile)
        subarena = arena.copy(indices=members)
        centroid_fp = arena.get_fingerprint(centroid_idx)
        result = subarena.threshold_tanimoto_search_fp(centroid_fp, 
threshold=0.0)
        result.reorder()  # sort so the highest scores come first
        for id, score in result.get_ids_and_scores():
            print("=>", id, "score:", score)


Cheers,

                                Andrew
                                [email protected]




_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Butina clustering with additional output

Reply via email to