For me, a lot of this depends on what you intend to do with the
clustering.  If you want to pick a "representative" subset from a larger
dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
Cheminformatics has a k-means implementation that runs with FAISS.
Depending on your goal, choosing a subset with a diversity picker may fit
the bill.  One annoying aspect of diversity pickers is that the initial
selections tend to consist of strange molecules.

@Tristen can you provide more information on what you want to do with the
clustering results?


Pat

On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com>
wrote:

> You could consider using FAISS. An example of clustering 2.1M cmpds is
> described at
> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
>
>
> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
> tristan.camilleri...@um.edu.mt> wrote:
>
>> Hi,
>>
>> I am attempting to cluster a database of circa 4M small molecules and I
>> have hit several snags.
>> Using BulkTanimoto is not possible due to resiurces that are required. I
>> am now working with fpsim2 and chemfp to get a distance matrix (sparse
>> matrix). However, I am finding it very challenging to identify an
>> appropriate clustering algorithm. I have considered both k-medoids and
>> DBSCAN. Each of these has its own limitations, stating the number of
>> clusters for k-medoids and not obtaining centroids for DBSCAN.
>>
>> I was wondering whether there is an implementation of the stochastic
>> clustering analysis for clustering purposes, described in
>> https://doi.org/10.1021/ci970056l .
>>
>> Any suggestions on the best method for clustering large datasets, with
>> code suggestions, would be greatly appreciated. I am new to the subject and
>> would appreciate any help.
>>
>> Regards,
>> Tristan
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to