For me, a lot of this depends on what you intend to do with the clustering. If you want to pick a "representative" subset from a larger dataset, k-means may do the trick. As Rajarshi mentioned, Practical Cheminformatics has a k-means implementation that runs with FAISS. Depending on your goal, choosing a subset with a diversity picker may fit the bill. One annoying aspect of diversity pickers is that the initial selections tend to consist of strange molecules.
@Tristen can you provide more information on what you want to do with the clustering results? Pat On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com> wrote: > You could consider using FAISS. An example of clustering 2.1M cmpds is > described at > http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html > > > On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < > tristan.camilleri...@um.edu.mt> wrote: > >> Hi, >> >> I am attempting to cluster a database of circa 4M small molecules and I >> have hit several snags. >> Using BulkTanimoto is not possible due to resiurces that are required. I >> am now working with fpsim2 and chemfp to get a distance matrix (sparse >> matrix). However, I am finding it very challenging to identify an >> appropriate clustering algorithm. I have considered both k-medoids and >> DBSCAN. Each of these has its own limitations, stating the number of >> clusters for k-medoids and not obtaining centroids for DBSCAN. >> >> I was wondering whether there is an implementation of the stochastic >> clustering analysis for clustering purposes, described in >> https://doi.org/10.1021/ci970056l . >> >> Any suggestions on the best method for clustering large datasets, with >> code suggestions, would be greatly appreciated. I am new to the subject and >> would appreciate any help. >> >> Regards, >> Tristan >> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > -- > Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha> > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss