On Sep 21, 2018, at 14:53, Philipp Thiel <[email protected]> wrote: > you probably read about the Tanimoto being a proper metric in case of having > binary data > in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the > revised edition.
What we call Tanimoto is more broadly known as the Jaccard. Various sites demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, such as https://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance and https://arxiv.org/abs/1612.02696 . Going back to James T. Metz's original question, one alternative might be to use chemfp and the Taylor-Butina clustering implementation available at: http://dalkescientific.com/writings/taylor_butina.py Following Dave Cosgrove's advice: > I expect James means what we used to call the cluster seed, i.e. the molecule > the cluster was based on, rather than the mathematical centroid. Calculating > distances from each cluster member to that would be quite straightforward as > a post-processing step although that would roughly double the time taken. it's possible to change the reporting code from: for centroid_idx, members in clusters: print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile) print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile) so it does the post-processing: print(len(clusters), "clusters", file=outfile) for centroid_idx, members in clusters: print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile) subarena = arena.copy(indices=members) centroid_fp = arena.get_fingerprint(centroid_idx) result = subarena.threshold_tanimoto_search_fp(centroid_fp, threshold=0.0) result.reorder() # sort so the highest scores come first for id, score in result.get_ids_and_scores(): print("=>", id, "score:", score) Cheers, Andrew [email protected] _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

