You may want to look at chemfp for this sort of clustering.
Last year Chris Swain reviewed a few different ways to do clustering, at
https://www.macinchem.org/reviews/clustering/clustering.php . His data set had
4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity threshold.
Chemfp doesn't include the Taylor-Butina algorithm as part of the distribution.
That will likely be included in the next release.
Instead, I worked with Chris to develop a version he could use for testing. It
looks like the copy from his web page is not available (the download URL
redirects to itself, producing an infinite loop).
I have put a copy at http://dalkescientific.com/writings/taylor_butina.py , if
you want to try it out.
> On Jan 11, 2018, at 09:27, Wandré <wandrevel...@gmail.com> wrote:
> (first of all, sorry by my poor english...)
> I'm trying to clustering a large dataset of molecules, but, in a server with
> 64GB of RAM and 32 cores, all RAM memory and cache are occuped and, after 10
> hours, the clustering is not calculated yet.
> My set of molecules have more than 1 million of hits, I'm using the atompair
> fingerprint and clusterFPS Butina algorithm to clustering.
> What can I do?
> I thought about calculating all the fingerprints, store them in my relational
> PostgreSQL database (not cartridge), store the result of
> BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM and
> allow to run new clustering in minor time (I spend 30 minutes just to
> calculate all fingerprints).
> How to store this values (fingerprint and BulkTanimotoSimilarity)?
> Here is a part of my code:
> for i in range(0, len(tb_hit_data)):
> #This step I want to save to use less CPU time (just run once)
> mol = Chem.MolFromInchi(tb_hit_data[i])
> print "in mol", tb_hit_data[i], "AtomPair cannot be generated"
> clusters = self.clusterfps(fps, cutoff_value)
> def clusterfps(cls, fps, cutoff=0.99):
> """Method that clustering all data, passed in fps, with an specific cutoff
> from rdkit.ML.Cluster import Butina
> # first generate the distance matrix:
> dists = 
> nfps = len(fps)
> for i in range(1, nfps):
> #This is other step that I want to store in database (just run once)
> sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
> dists.extend([1 - x for x in sims])
> # now cluster the data:
> cluster_data = Butina.ClusterData(dists, nfps, cutoff, isDistData=True)
> return cluster_data
> # End def clusterfps
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência
> Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Rdkit-discuss mailing list