Hi Michal,

I have done this a couple of times for compound sets up to 10M+ using a
simplified variant of the Taylor-Butina algorithm. The overall run time
was in the range of hours to a few days (which could probably be
optimized, but was fast enough for me).

As you correctly mentioned, getting the (sparse) similarity matrix is
fairly simple (and can be done in parallel on a cluster). Unfortunately,
this matrix gets very large (even the sparse version). Most clustering
algorithms require random access to the matrix, so you have to keep it
in main memory (which then has to be huge) or calculate it on-the-fly
(takes forever).

My implementation (in C++, not sure if I can share it) assumes that the
similarity matrix has been pre-calculated and is stored in one (or
multiple) files. It reads these files sequentially and whenever a
compound pair with a similarity beyond the threshold is found, it checks
whether one of the cpds. is already a centroid (in which case the other
is assigned to it). Otherwise, one of the compounds is randomly chosen
as centroid and the other is assigned to it.

This procedure is highly order-dependent and thus not optimal, but has
to read the whole similarity matrix only once and has limited memory
consumption (you only need to keep a list of centroids). If you still
run into memory issues, you can start by clustering with a high
similarity threshold and then re-cluster centroids and singletons on a
lower threshold level.

I also played around with DBSCAN for large compound databases, but (as
previously mentioned by Samo) found it difficult to find the right
parameters and ended up with a single huge cluster covering 90 percent
of the database in many cases.

Hope this helps,
Nils

Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
> Is there anyone who actually done this: clustered >2M compounds using
> any well-known clustering algorithm and is willing to share a code and
> some performance statistics?


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to