Hi,

Is there anyone who actually done this: clustered >2M compounds using
any well-known clustering algorithm and is willing to share a code and
some performance statistics?

It's easy to get a sparse distance matrix using chemfp. But if you
take this matrix and feed it into any scipy.cluster you want get any
results in a reasonable time.

We also tried to extract 10 most significant features from the latent
representation described in this paper:
https://arxiv.org/pdf/1610.02415v1.pdf for all compounds in ChEMBL and
then use this web-based tool to generate visualization
https://github.com/tensorflow/embedding-projector-standalone but
obviously we didn't get anything useful from this.

My last attempt was to use sfdp tool from graphviz package to get some
sort of primitive clustering. I allocated a lot of RAM memory to the
process but without any luck as well.

I would be interested in all kinds of hints related to clustering
millions of compounds, especially using DBSCAN/OPTICS-based clustering
algorithms.

Regards,

Michał Nowotka

On Mon, Jun 5, 2017 at 9:19 AM, Gonzalo Colmenarejo
<colmenarejo.gonz...@gmail.com> wrote:
> Hi Chris,
>
> as far as I know, Butina's sphere exclusion algorithm is the fastest for
> very large datasets. But if you have 4 million compounds, using RDKit
> directly can result in very long runs, even after parallellization. For that
> number of molecules I think there are faster things, like chemfp (see for
> instance
> https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering).
>
> Cheers
>
> Gonzalo
>
> On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski <mac...@wojcikowski.pl>
> wrote:
>>
>> Is there a big difference in the quality of the final dataset between
>> K-means and random under-sampling of big database (~20M)?
>>
>> ----
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> mac...@wojcikowski.pl
>>
>> 2017-06-04 12:24 GMT+02:00 Samo Turk <samo.t...@gmail.com>:
>>>
>>> Hi Chris,
>>>
>>> There are other options for clustering. According to this:
>>> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
>>> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
>>> density and it also allows for outliers, but can be fiddly to find the right
>>> parametes. You can not specify the number of clusters (like in Butina case).
>>> If you want to specify the number of clusters, you can simply use K-means.
>>> High dimensionality of fingerprints might be a problem for memory
>>> consumption. In this case you can use PCA to reduce dimensions to something
>>> manageable. To avoid memory issues with PCA and speed things up I would fit
>>> the model on random 100k compounds and then just use fit_transform method on
>>> the rest.
>>> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
>>>
>>> Cheers,
>>> Samo
>>>
>>> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@mac.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I want to do clustering on around 4 million structures
>>>>
>>>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
>>>>
>>>> "For large sets of molecules (more than 1000-2000), it’s most efficient
>>>> to use the Butina clustering algorithm”
>>>>
>>>>  However it is quite a step up from a few thousand to several million
>>>> and I wondered if anyone had used this algorithm on larger data sets?
>>>>
>>>> As far as I can tell it is not possible to define the number of
>>>> clusters, is this correct?
>>>>
>>>> Cheers,
>>>>
>>>> Chris
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to