Similarity search on a database of 4 million is pretty quick with ChemFp or fpsim2. Do you need to do the clustering?
Here are a couple of relevant blog posts. http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html Pat On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri < tristan.camilleri...@um.edu.mt> wrote: > Thank you both for the feedback. > > My primary aim is to run an LBVS experiment (similarity search) using a > set of actives and the dataset of cluster representatives. > > > > On Sun, 1 May 2022, 17:09 Patrick Walters, <wpwalt...@gmail.com> wrote: > >> For me, a lot of this depends on what you intend to do with the >> clustering. If you want to pick a "representative" subset from a larger >> dataset, k-means may do the trick. As Rajarshi mentioned, Practical >> Cheminformatics has a k-means implementation that runs with FAISS. >> Depending on your goal, choosing a subset with a diversity picker may fit >> the bill. One annoying aspect of diversity pickers is that the initial >> selections tend to consist of strange molecules. >> >> @Tristen can you provide more information on what you want to do with the >> clustering results? >> >> >> Pat >> >> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com> >> wrote: >> >>> You could consider using FAISS. An example of clustering 2.1M cmpds is >>> described at >>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html >>> >>> >>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < >>> tristan.camilleri...@um.edu.mt> wrote: >>> >>>> Hi, >>>> >>>> I am attempting to cluster a database of circa 4M small molecules and I >>>> have hit several snags. >>>> Using BulkTanimoto is not possible due to resiurces that are required. >>>> I am now working with fpsim2 and chemfp to get a distance matrix (sparse >>>> matrix). However, I am finding it very challenging to identify an >>>> appropriate clustering algorithm. I have considered both k-medoids and >>>> DBSCAN. Each of these has its own limitations, stating the number of >>>> clusters for k-medoids and not obtaining centroids for DBSCAN. >>>> >>>> I was wondering whether there is an implementation of the stochastic >>>> clustering analysis for clustering purposes, described in >>>> https://doi.org/10.1021/ci970056l . >>>> >>>> Any suggestions on the best method for clustering large datasets, with >>>> code suggestions, would be greatly appreciated. I am new to the subject and >>>> would appreciate any help. >>>> >>>> Regards, >>>> Tristan >>>> >>>> >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdkit-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> >>> >>> -- >>> Rajarshi Guha | http://blog.rguha.net | @rguha >>> <https://twitter.com/rguha> >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss