You could consider using FAISS. An example of clustering 2.1M cmpds is
described at
http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html


On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
tristan.camilleri...@um.edu.mt> wrote:

> Hi,
>
> I am attempting to cluster a database of circa 4M small molecules and I
> have hit several snags.
> Using BulkTanimoto is not possible due to resiurces that are required. I
> am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix). However, I am finding it very challenging to identify an
> appropriate clustering algorithm. I have considered both k-medoids and
> DBSCAN. Each of these has its own limitations, stating the number of
> clusters for k-medoids and not obtaining centroids for DBSCAN.
>
> I was wondering whether there is an implementation of the stochastic
> clustering analysis for clustering purposes, described in
> https://doi.org/10.1021/ci970056l .
>
> Any suggestions on the best method for clustering large datasets, with
> code suggestions, would be greatly appreciated. I am new to the subject and
> would appreciate any help.
>
> Regards,
> Tristan
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to