Re: [Rdkit-discuss] Clustering

2022-05-02 Thread Tristan Camilleri
package > for that https://igraph.org/r/, https://kateto.net/netscix2016.html). > > > > Hope this helps. > > Giovanni > > > > *From:* Tristan Camilleri > *Sent:* 02 May 2022 07:03 > *To:* Patrick Walters > *Cc:* RDKit Discuss > *Subject:* Re: [Rdkit

Re: [Rdkit-discuss] Clustering

2022-05-02 Thread Giovanni Tricarico
ot 'linked') you can do graph representations and even >clustering, e.g. using igraph (again, I can only suggest an R package for that >https://igraph.org/r/, https://kateto.net/netscix2016.html). Hope this helps. Giovanni From: Tristan Camilleri Sent: 02 May 2022 07:03 To: Pa

Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Tristan Camilleri
Thanks for the feedback. Rather than an explicit need to perform clustering, it is more for me to learn how to do it. Any pointers to this effect would be greatly appreciated. Tristan On Sun, 1 May 2022 at 18:18, Patrick Walters wrote: > Similarity search on a database of 4 million is pretty q

Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Tristan Camilleri
Thank you both for the feedback. My primary aim is to run an LBVS experiment (similarity search) using a set of actives and the dataset of cluster representatives. On Sun, 1 May 2022, 17:09 Patrick Walters, wrote: > For me, a lot of this depends on what you intend to do with the > clustering.

Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Patrick Walters
Similarity search on a database of 4 million is pretty quick with ChemFp or fpsim2. Do you need to do the clustering? Here are a couple of relevant blog posts. http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html http://practicalcheminformatics.blogspo

Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Patrick Walters
For me, a lot of this depends on what you intend to do with the clustering. If you want to pick a "representative" subset from a larger dataset, k-means may do the trick. As Rajarshi mentioned, Practical Cheminformatics has a k-means implementation that runs with FAISS. Depending on your goal, ch

Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Rajarshi Guha
You could consider using FAISS. An example of clustering 2.1M cmpds is described at http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < tristan.camilleri...@um.edu.mt> wrote: > Hi, > > I am attempting

Re: [Rdkit-discuss] Clustering changes conformer ID?

2019-12-04 Thread topgunhaides .
Hi Greg, Thanks for the help! Sorry for the confusion. I was trying to get symmetric RMS matrix using GetBestRMS, because the GetConformerRMSMatrix use standard RMS method without considering symmetry. A further question, is it possible to include the "GetBestRMS " option for "EmbedMultipleConfs

Re: [Rdkit-discuss] Clustering changes conformer ID?

2019-12-02 Thread Greg Landrum
Hi Leon, I'm not sure I understand the question. The clustering code returns a tuple of indices of the clusters. Those indices are relative to the indexing of the distance matrix. The `ClusterData` function doesn't know what you're clustering, so there's no way it could know anything about cluster

Re: [Rdkit-discuss] Clustering

2017-06-14 Thread Andrew Dalke
Following up on myself, On Jun 6, 2017, at 04:00, Andrew Dalke wrote: > I've fleshed out that algorithm so it's a command-line program that can be > used for benchmarking purposes. It's available from > http://dalkescientific.com/writings/taylor_butina.py . > > If anyone uses it for benchmarki

Re: [Rdkit-discuss] Clustering

2017-06-12 Thread Peter S. Shenkin
" A clustering algorithm, that does not require specifying the number of classes upfront (so not K-means)." A general approach to O(N) hierarchical clustering is: 1. Pick a random sqrt(N) structures. 2. Do full hierarchical O(N^2) clustering on these. 3. Select your favored clustering level to de

Re: [Rdkit-discuss] Clustering

2017-06-12 Thread Michał Nowotka
Hi, Thanks for all the answers, especially those pointing to code examples, very useful. I should be more specific when asking about clustering >2M compounds. An example I would like to see would use: 1. A clustering algorithm, that does not require specifying the number of classes upfront (so n

Re: [Rdkit-discuss] Clustering

2017-06-11 Thread Samo Turk
Hi All, I have to admit I was commenting about PCA->k-means without actually trying. Out of curiosity I implemented it here: https://github.com/samoturk/cheminf-notebooks/tree/master/Python#pca-k-meanspy It can process 4M compounds in ~60 minutes on desktop i5 and it should work with 16GB or RAM.

Re: [Rdkit-discuss] Clustering

2017-06-06 Thread Abhik Seal
Hello all , How about doing some dimension reduction using pca or Tsne and then run clustering using some selected top components like top 20 and I think then the clustering would be fast . Thanks Abhik On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove wrote: > Hi, > I have used this algorithm fo

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Andrew Dalke
On Jun 5, 2017, at 11:02, Michał Nowotka wrote: > Is there anyone who actually done this: clustered >2M compounds using > any well-known clustering algorithm and is willing to share a code and > some performance statistics? Yes. People regularly use chemfp (http://chemfp.com/) to cluster >2M comp

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread David Cosgrove
Hi, I have used this algorithm for many years clustering sets of several millions of compounds. Indeed, I am old enough to know it as the Taylor algorithm. It is slow but reliable. A crucial setting is the similarity threshold for the clusters, which dictates the size of the neighbour lists and

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Nils Weskamp
Hi Michal, I have done this a couple of times for compound sets up to 10M+ using a simplified variant of the Taylor-Butina algorithm. The overall run time was in the range of hours to a few days (which could probably be optimized, but was fast enough for me). As you correctly mentioned, getting t

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Chris Swain
Hi, I’m just starting but I can add another example I tried the clustering as described for the Butina clustering (http://www.rdkit.org/docs/Cookbook.html ) using a Jupiter Notebook. Worked fine on data sets < 10,000 molecules but kernel crash when I tr

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Michał Nowotka
Hi, Is there anyone who actually done this: clustered >2M compounds using any well-known clustering algorithm and is willing to share a code and some performance statistics? It's easy to get a sparse distance matrix using chemfp. But if you take this matrix and feed it into any scipy.cluster you

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Gonzalo Colmenarejo
Hi Chris, as far as I know, Butina's sphere exclusion algorithm is the fastest for very large datasets. But if you have 4 million compounds, using RDKit directly can result in very long runs, even after parallellization. For that number of molecules I think there are faster things, like chemfp (se

Re: [Rdkit-discuss] Clustering

2017-06-04 Thread Maciek Wójcikowski
Is there a big difference in the quality of the final dataset between K-means and random under-sampling of big database (~20M)? Pozdrawiam, | Best regards, Maciek Wójcikowski mac...@wojcikowski.pl 2017-06-04 12:24 GMT+02:00 Samo Turk : > Hi Chris, > > There are other options for clusterin

Re: [Rdkit-discuss] Clustering

2017-06-04 Thread Samo Turk
Hi Chris, There are other options for clustering. According to this: http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html HDBSCAN and K-means scale well. HDBSCAN will find clusters based on density and it also allows for outliers, but can be fiddly to find the right parametes.

Re: [Rdkit-discuss] Clustering - visualization?

2016-05-14 Thread Robert DeLisle
Thanks, Curt! I'll give those a look. It'll give me a very good reason to start digging into SciPy a bit more and exploit the added functionality that will bring. Regarding my original question and for anyone else that might be interested... I did indeed find an answer through a lot of code dre

Re: [Rdkit-discuss] Clustering - visualization?

2016-05-14 Thread Curt Fischer
Hi Robert, For the number of molecules you are interested in, it's viable to use SciPy / NumPy clustering functions instead of rdkit's built in C-linked functions. This approach will probably not be as fast rdkit's built-in clustering functionalities, and will probably not scale to tens of thousa

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Maciek Wójcikowski
One small notice from me - I would still use other agregative function instead of sum to get binary FP: np.reshape(fpa, (4, -1)).any(axis = 0) I guess it doesn't change a thing with tanimoto, but if you try other distances then you can get unexpected results (assuming there are crashes). Pozd

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Jing Lu
Thanks, Greg, Yes, sciket learn will automatically promote to arrays of float with check_array() function. What I am currently doing is fpa = numpy.zeros((len(fp),),numpy.double) DataStructs.ConvertToNumpyArray(fp,fpa) np.sum(np.reshape(fpa, (4, -1)), axis = 0) Is this the same as FoldFingerpr

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Greg Landrum
If that doesn't help (and it may not since some Scikit-Learn functions automatically promote their arguments to arrays of doubles), you can always just generate a shorter fingerprint from the beginning (all the fingerprinting functions take an optional argument for this) or fold the existing finger

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Maciek Wójcikowski
Hi Jing, Most fingerprints are binary, thus can be stored as np.bool_, which compared to double should be 64 times more memory efficient. Best, Maciej Pozdrawiam, | Best regards, Maciek Wójcikowski mac...@wojcikowski.pl 2015-08-27 16:15 GMT+02:00 Jing Lu : > Hi Greg, > > Thanks! It work

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Jing Lu
Hi Greg, Thanks! It works! But, is that possible to fold the fingerprint to smaller size? np.zeros((100,2048)) still takes a lot of memory... Best, Jing On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum wrote: > > On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu wrote: > >> >> So, I wonder is there

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Greg Landrum
On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu wrote: > > So, I wonder is there any way to convert fingerprint to a numpy vector? > Indeed there is: In [11]: from rdkit import Chem In [12]: from rdkit import DataStructs In [13]: import numpy In [14]: m =Chem.MolFromSmiles('C1CCC1') In [15]: fp =

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Jing Lu
Sorry to bother again... Now, the most time consuming part is clustering. The process getting the fingerprints only takes less than 1h. But, the process for clustering has already taken more than 30h, and I am not sure when it will finish. Currently, I use scikit learn DBSCAN, which has time comp

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote: > I hope the memory issue won't be a problem. That's up to you and your choice of threshold. > Most AgglomerativeClustering algorithms have time complexity with N^2. Will > that be a problem? You have to decided for yourself what counts as a problem.

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Dimitri Maziuk
On 08/23/2015 11:38 AM, Jing Lu wrote: > Thanks, Andrew! > > Yes, I was thinking about using scikit-learn also. But I guess I need to > use a data structure for sparse matrix and define a function for > connectivity. I hope the memory issue won't be a problem. > Most AgglomerativeClustering algori

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Takayuki, For both Repeated Bisection clustering and K-means clustering, they all need the number of clusters as input, right? Best, Jing On Sun, Aug 23, 2015 at 1:17 AM, Taka Seri wrote: > Dear Jing, > > How about your trying using bayon ? > https://code.google.com/p/bayon/ > It's no

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Andrew! Yes, I was thinking about using scikit-learn also. But I guess I need to use a data structure for sparse matrix and define a function for connectivity. I hope the memory issue won't be a problem. Most AgglomerativeClustering algorithms have time complexity with N^2. Will that be a

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote: > If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I > calculate the distance between every pair of molecules, the size of distance > matrix will be too big. Does RDKit support any heuristic clustering algorithm > without cal

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Taka Seri
Dear Jing, How about your trying using bayon ? https://code.google.com/p/bayon/ It's not function of RDKit, but I think the library can cluster molecules using ECFP4. Unfortunately, input file format of bayon is not distance matrix but easy to prepare the format. Best regards. Takayuki 2015年8

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Jing Lu
Currently, I prefer fingerprint based clustering, because it's hard to set the cutoff for scaffold based clustering. Does RDKit have scaffold based clustering? On Sat, Aug 22, 2015 at 10:56 PM, wrote: > Hi, how about scaffold based clustering . You extract the scaffolds and > then cluster it and

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread abhik1368
Hi, how about scaffold based clustering . You extract the scaffolds and then cluster it and then put the respective scaffold compounds inside the cluster . Sent from my iPhone > On Aug 22, 2015, at 8:43 PM, Jing Lu wrote: > > Dear RDKit users, > > If I want to cluster more than 1M molecules

Re: [Rdkit-discuss] Clustering in RDKit (take 2 - missing wiki link)

2015-04-14 Thread JP
This is now at: https://github.com/rdkit/rdkit/blob/master/Docs/Book/Cookbook.rst - Jean-Paul Ebejer Early Stage Researcher On 11 April 2015 at 10:46, JP wrote: > Hi RDKitters! > > I have a bit of python RDKit clustering code using Butina which is > commented with: > # Ripped off from https://c

Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Greg Landrum
On Mon, Feb 23, 2015 at 4:24 PM, Patrick Walters wrote: > I agree that there are plenty of implementations of clustering, machine > learning, etc. It would be better for the RDKit developers to focus on > cheminformatics. This being said, there are some opportunities for domain > specific perf

Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Patrick Walters
I agree that there are plenty of implementations of clustering, machine learning, etc. It would be better for the RDKit developers to focus on cheminformatics. This being said, there are some opportunities for domain specific performance enhancement. One of the slow steps in many clustering alg

Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Maciek Wójcikowski
Hello, If interested in clustering in python I can recommend, as usual, sklearn: http://scikit-learn.org/stable/modules/clustering.html It's pretty much all you should need. Have fun! Pozdrawiam, | Best regards, Maciek Wójcikowski mac...@wojcikowski.pl 2015-02-23 11:43 GMT+01:00 Anthony B

Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Anthony Bradley
Hi Anthony, On Sun, Feb 22, 2015 at 11:03 AM, Anthony Bradley mailto:anthony.brad...@worc.ox.ac.uk>> wrote: Hi all, I am currently working with RDKit from the Java API (well jython actually). As has been discussed most of the documentation for this is found by trawling: Code/JavaWrappers/gmwra

Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Greg Landrum
Hi Anthony, On Sun, Feb 22, 2015 at 11:03 AM, Anthony Bradley < anthony.brad...@worc.ox.ac.uk> wrote: > Hi all, > > I am currently working with RDKit from the Java API (well jython actually). > > As has been discussed most of the documentation for this is found by > trawling: > > Code/JavaWrapper