Hi Wandré,

  You may want to look at chemfp for this sort of clustering.

Last year Chris Swain reviewed a few different ways to do clustering, at 
https://www.macinchem.org/reviews/clustering/clustering.php . His data set had 
4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity threshold.

Chemfp doesn't include the Taylor-Butina algorithm as part of the distribution. 
That will likely be included in the next release.

Instead, I worked with Chris to develop a version he could use for testing. It 
looks like the copy from his web page is not available (the download URL 
redirects to itself, producing an infinite loop).

I have put a copy at http://dalkescientific.com/writings/taylor_butina.py , if 
you want to try it out.

Best regards,


> On Jan 11, 2018, at 09:27, Wandré <wandrevel...@gmail.com> wrote:
> Hi,
> (first of all, sorry by my poor english...)
> I'm trying to clustering a large dataset of molecules, but, in a server with 
> 64GB of RAM and 32 cores, all RAM memory and cache are occuped and, after 10 
> hours, the clustering is not calculated yet.
> My set of molecules have more than 1 million of hits, I'm using the atompair 
> fingerprint and clusterFPS Butina algorithm to clustering.
> What can I do?
> I thought about calculating all the fingerprints, store them in my relational 
> PostgreSQL database (not cartridge), store the result of 
> BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM and 
> allow to run new clustering in minor time (I spend 30 minutes just to 
> calculate all fingerprints).
> How to store this values (fingerprint and BulkTanimotoSimilarity)?
> Here is a part of my code:
> for i in range(0, len(tb_hit_data)):
>     try:
>         #This step I want to save to use less CPU time (just run once)
>         mol = Chem.MolFromInchi(tb_hit_data[i][1])
>         fps.append(Pairs.GetAtomPairFingerprint(mol))
>         ids.append(tb_hit_data[i][0])
>     except:
>         print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated"
> clusters = self.clusterfps(fps, cutoff_value)
> def clusterfps(cls, fps, cutoff=0.99):
>     """Method that clustering all data, passed in fps, with an specific cutoff
>     """
>     from rdkit.ML.Cluster import Butina
>     # first generate the distance matrix:
>     dists = []
>     nfps = len(fps)
>     for i in range(1, nfps):
>         #This is other step that I want to store in database (just run once)
>         sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
>         dists.extend([1 - x for x in sims])
>     # now cluster the data:
>     cluster_data = Butina.ClusterData(dists, nfps, cutoff, isDistData=True)
>     return cluster_data
> # End def clusterfps
> Thanks!
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e Inteligência 
> Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Rdkit-discuss mailing list

Reply via email to