Hi Wandré, You may want to look at chemfp for this sort of clustering.
Last year Chris Swain reviewed a few different ways to do clustering, at https://www.macinchem.org/reviews/clustering/clustering.php . His data set had 4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity threshold. Chemfp doesn't include the Taylor-Butina algorithm as part of the distribution. That will likely be included in the next release. Instead, I worked with Chris to develop a version he could use for testing. It looks like the copy from his web page is not available (the download URL redirects to itself, producing an infinite loop). I have put a copy at http://dalkescientific.com/writings/taylor_butina.py , if you want to try it out. Best regards, Andrew da...@dalkescientific.com > On Jan 11, 2018, at 09:27, Wandré <wandrevel...@gmail.com> wrote: > > Hi, > (first of all, sorry by my poor english...) > I'm trying to clustering a large dataset of molecules, but, in a server with > 64GB of RAM and 32 cores, all RAM memory and cache are occuped and, after 10 > hours, the clustering is not calculated yet. > My set of molecules have more than 1 million of hits, I'm using the atompair > fingerprint and clusterFPS Butina algorithm to clustering. > What can I do? > I thought about calculating all the fingerprints, store them in my relational > PostgreSQL database (not cartridge), store the result of > BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM and > allow to run new clustering in minor time (I spend 30 minutes just to > calculate all fingerprints). > How to store this values (fingerprint and BulkTanimotoSimilarity)? > Here is a part of my code: > > for i in range(0, len(tb_hit_data)): > try: > #This step I want to save to use less CPU time (just run once) > mol = Chem.MolFromInchi(tb_hit_data[i][1]) > fps.append(Pairs.GetAtomPairFingerprint(mol)) > ids.append(tb_hit_data[i][0]) > except: > print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated" > clusters = self.clusterfps(fps, cutoff_value) > > > def clusterfps(cls, fps, cutoff=0.99): > """Method that clustering all data, passed in fps, with an specific cutoff > """ > from rdkit.ML.Cluster import Butina > > # first generate the distance matrix: > dists = [] > nfps = len(fps) > for i in range(1, nfps): > #This is other step that I want to store in database (just run once) > sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i]) > dists.extend([1 - x for x in sims]) > > # now cluster the data: > cluster_data = Butina.ClusterData(dists, nfps, cutoff, isDistData=True) > return cluster_data > # End def clusterfps > > Thanks! > -- > Wandré Nunes de Pinho Veloso > Professor Assistente - Unifei - Campus Avançado de Itabira-MG > Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG > Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência > Computacional - UNIFEI > Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ > Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG > Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss