(first of all, sorry by my poor english...)
I'm trying to clustering a large dataset of molecules, but, in a server
with 64GB of RAM and 32 cores, all RAM memory and cache are occuped and,
after 10 hours, the clustering is not calculated yet.
My set of molecules have more than 1 million of hits, I'm using the
atompair fingerprint and clusterFPS Butina algorithm to clustering.
What can I do?
I thought about calculating all the fingerprints, store them in my
relational PostgreSQL database (not cartridge), store the result
of BulkTanimotoSimilarity (distance matrix, all against all) to use less
RAM and allow to run new clustering in minor time (I spend 30 minutes just
to calculate all fingerprints).
How to store this values (fingerprint and BulkTanimotoSimilarity)?
Here is a part of my code:

for i in range(0, len(tb_hit_data)):
*#This step I want to save to use less CPU time (just run once)*
mol = Chem.MolFromInchi(tb_hit_data[i][1])
print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated"
clusters = self.clusterfps(fps, cutoff_value)

def clusterfps(cls, fps, cutoff=0.99):
"""Method that clustering all data, passed in fps, with an specific cutoff
from rdkit.ML.Cluster import Butina

# first generate the distance matrix:
dists = []
nfps = len(fps)
for i in range(1, nfps):
*#This is other step that I want to store in database (just run once)*
sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
dists.extend([1 - x for x in sims])

# now cluster the data:
cluster_data = Butina.ClusterData(dists, nfps, cutoff, isDistData=True)
return cluster_data
# End def clusterfps

Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Rdkit-discuss mailing list

Reply via email to