Similarity search on a database of 4 million is pretty quick with ChemFp or
fpsim2.  Do you need to do the clustering?

Here are a couple of relevant blog posts.

http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html

http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html

Pat



On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri <
tristan.camilleri...@um.edu.mt> wrote:

> Thank you both for the feedback.
>
> My primary aim is to run an LBVS experiment (similarity search) using a
> set of actives and the dataset of cluster representatives.
>
>
>
> On Sun, 1 May 2022, 17:09 Patrick Walters, <wpwalt...@gmail.com> wrote:
>
>> For me, a lot of this depends on what you intend to do with the
>> clustering.  If you want to pick a "representative" subset from a larger
>> dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
>> Cheminformatics has a k-means implementation that runs with FAISS.
>> Depending on your goal, choosing a subset with a diversity picker may fit
>> the bill.  One annoying aspect of diversity pickers is that the initial
>> selections tend to consist of strange molecules.
>>
>> @Tristen can you provide more information on what you want to do with the
>> clustering results?
>>
>>
>> Pat
>>
>> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com>
>> wrote:
>>
>>> You could consider using FAISS. An example of clustering 2.1M cmpds is
>>> described at
>>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
>>>
>>>
>>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
>>> tristan.camilleri...@um.edu.mt> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am attempting to cluster a database of circa 4M small molecules and I
>>>> have hit several snags.
>>>> Using BulkTanimoto is not possible due to resiurces that are required.
>>>> I am now working with fpsim2 and chemfp to get a distance matrix (sparse
>>>> matrix). However, I am finding it very challenging to identify an
>>>> appropriate clustering algorithm. I have considered both k-medoids and
>>>> DBSCAN. Each of these has its own limitations, stating the number of
>>>> clusters for k-medoids and not obtaining centroids for DBSCAN.
>>>>
>>>> I was wondering whether there is an implementation of the stochastic
>>>> clustering analysis for clustering purposes, described in
>>>> https://doi.org/10.1021/ci970056l .
>>>>
>>>> Any suggestions on the best method for clustering large datasets, with
>>>> code suggestions, would be greatly appreciated. I am new to the subject and
>>>> would appreciate any help.
>>>>
>>>> Regards,
>>>> Tristan
>>>>
>>>>
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>
>>>
>>> --
>>> Rajarshi Guha | http://blog.rguha.net | @rguha
>>> <https://twitter.com/rguha>
>>>
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to