Is there a big difference in the quality of the final dataset between
K-means and random under-sampling of big database (~20M)?

----
Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2017-06-04 12:24 GMT+02:00 Samo Turk <samo.t...@gmail.com>:

> Hi Chris,
>
> There are other options for clustering. According to this: http://hdbscan.
> readthedocs.io/en/latest/performance_and_scalability.html
> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
> density and it also allows for outliers, but can be fiddly to find the
> right parametes. You can not specify the number of clusters (like in Butina
> case). If you want to specify the number of clusters, you can simply use
> K-means. High dimensionality of fingerprints might be a problem for memory
> consumption. In this case you can use PCA to reduce dimensions to something
> manageable. To avoid memory issues with PCA and speed things up I would fit
> the model on random 100k compounds and then just use fit_transform method
> on the rest. http://scikit-learn.org/stable/modules/generated/
> sklearn.decomposition.PCA.html
>
> Cheers,
> Samo
>
> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@mac.com> wrote:
>
>> Hi,
>>
>> I want to do clustering on around 4 million structures
>>
>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
>>
>> "For large sets of molecules (more than 1000-2000), it’s most efficient
>> to use the Butina clustering algorithm”
>>
>>  However it is quite a step up from a few thousand to several million and
>> I wondered if anyone had used this algorithm on larger data sets?
>>
>> As far as I can tell it is not possible to define the number of clusters,
>> is this correct?
>>
>> Cheers,
>>
>> Chris
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to