Hi Andrew,

Thanks for the link. It is very interesting. I will read very carefully.
So, as input on ChemFP, I have to put a file with all molecules in 1 SDF?

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2018-01-11 6:59 GMT-02:00 Andrew Dalke <da...@dalkescientific.com>:

> Hi Wandré,
>
>   You may want to look at chemfp for this sort of clustering.
>
> Last year Chris Swain reviewed a few different ways to do clustering, at
> https://www.macinchem.org/reviews/clustering/clustering.php . His data
> set had 4.4M fingerprints and it took 10 hours to cluster at 0.8 similarity
> threshold.
>
> Chemfp doesn't include the Taylor-Butina algorithm as part of the
> distribution. That will likely be included in the next release.
>
> Instead, I worked with Chris to develop a version he could use for
> testing. It looks like the copy from his web page is not available (the
> download URL redirects to itself, producing an infinite loop).
>
> I have put a copy at http://dalkescientific.com/writings/taylor_butina.py
> , if you want to try it out.
>
> Best regards,
>
>                                 Andrew
>                                 da...@dalkescientific.com
>
>
> > On Jan 11, 2018, at 09:27, Wandré <wandrevel...@gmail.com> wrote:
> >
> > Hi,
> > (first of all, sorry by my poor english...)
> > I'm trying to clustering a large dataset of molecules, but, in a server
> with 64GB of RAM and 32 cores, all RAM memory and cache are occuped and,
> after 10 hours, the clustering is not calculated yet.
> > My set of molecules have more than 1 million of hits, I'm using the
> atompair fingerprint and clusterFPS Butina algorithm to clustering.
> > What can I do?
> > I thought about calculating all the fingerprints, store them in my
> relational PostgreSQL database (not cartridge), store the result of
> BulkTanimotoSimilarity (distance matrix, all against all) to use less RAM
> and allow to run new clustering in minor time (I spend 30 minutes just to
> calculate all fingerprints).
> > How to store this values (fingerprint and BulkTanimotoSimilarity)?
> > Here is a part of my code:
> >
> > for i in range(0, len(tb_hit_data)):
> >     try:
> >         #This step I want to save to use less CPU time (just run once)
> >         mol = Chem.MolFromInchi(tb_hit_data[i][1])
> >         fps.append(Pairs.GetAtomPairFingerprint(mol))
> >         ids.append(tb_hit_data[i][0])
> >     except:
> >         print "in mol", tb_hit_data[i][0], "AtomPair cannot be generated"
> > clusters = self.clusterfps(fps, cutoff_value)
> >
> >
> > def clusterfps(cls, fps, cutoff=0.99):
> >     """Method that clustering all data, passed in fps, with an specific
> cutoff
> >     """
> >     from rdkit.ML.Cluster import Butina
> >
> >     # first generate the distance matrix:
> >     dists = []
> >     nfps = len(fps)
> >     for i in range(1, nfps):
> >         #This is other step that I want to store in database (just run
> once)
> >         sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
> >         dists.extend([1 - x for x in sims])
> >
> >     # now cluster the data:
> >     cluster_data = Butina.ClusterData(dists, nfps, cutoff,
> isDistData=True)
> >     return cluster_data
> > # End def clusterfps
> >
> > Thanks!
> > --
> > Wandré Nunes de Pinho Veloso
> > Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> > Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
> UFMG
> > Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> > Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> > Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> > Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to