Hi, Is there anyone who actually done this: clustered >2M compounds using any well-known clustering algorithm and is willing to share a code and some performance statistics?
It's easy to get a sparse distance matrix using chemfp. But if you take this matrix and feed it into any scipy.cluster you want get any results in a reasonable time. We also tried to extract 10 most significant features from the latent representation described in this paper: https://arxiv.org/pdf/1610.02415v1.pdf for all compounds in ChEMBL and then use this web-based tool to generate visualization https://github.com/tensorflow/embedding-projector-standalone but obviously we didn't get anything useful from this. My last attempt was to use sfdp tool from graphviz package to get some sort of primitive clustering. I allocated a lot of RAM memory to the process but without any luck as well. I would be interested in all kinds of hints related to clustering millions of compounds, especially using DBSCAN/OPTICS-based clustering algorithms. Regards, Michał Nowotka On Mon, Jun 5, 2017 at 9:19 AM, Gonzalo Colmenarejo <colmenarejo.gonz...@gmail.com> wrote: > Hi Chris, > > as far as I know, Butina's sphere exclusion algorithm is the fastest for > very large datasets. But if you have 4 million compounds, using RDKit > directly can result in very long runs, even after parallellization. For that > number of molecules I think there are faster things, like chemfp (see for > instance > https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering). > > Cheers > > Gonzalo > > On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski <mac...@wojcikowski.pl> > wrote: >> >> Is there a big difference in the quality of the final dataset between >> K-means and random under-sampling of big database (~20M)? >> >> ---- >> Pozdrawiam, | Best regards, >> Maciek Wójcikowski >> mac...@wojcikowski.pl >> >> 2017-06-04 12:24 GMT+02:00 Samo Turk <samo.t...@gmail.com>: >>> >>> Hi Chris, >>> >>> There are other options for clustering. According to this: >>> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html >>> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on >>> density and it also allows for outliers, but can be fiddly to find the right >>> parametes. You can not specify the number of clusters (like in Butina case). >>> If you want to specify the number of clusters, you can simply use K-means. >>> High dimensionality of fingerprints might be a problem for memory >>> consumption. In this case you can use PCA to reduce dimensions to something >>> manageable. To avoid memory issues with PCA and speed things up I would fit >>> the model on random 100k compounds and then just use fit_transform method on >>> the rest. >>> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html >>> >>> Cheers, >>> Samo >>> >>> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@mac.com> wrote: >>>> >>>> Hi, >>>> >>>> I want to do clustering on around 4 million structures >>>> >>>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests >>>> >>>> "For large sets of molecules (more than 1000-2000), it’s most efficient >>>> to use the Butina clustering algorithm” >>>> >>>> However it is quite a step up from a few thousand to several million >>>> and I wondered if anyone had used this algorithm on larger data sets? >>>> >>>> As far as I can tell it is not possible to define the number of >>>> clusters, is this correct? >>>> >>>> Cheers, >>>> >>>> Chris >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdkit-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss