Hi Robert,
For the number of molecules you are interested in, it's viable to use SciPy
/ NumPy clustering functions instead of rdkit's built in C-linked
functions. This approach will probably not be as fast rdkit's built-in
clustering functionalities, and will probably not scale to tens of
thousands of molecules as well as rdkit's functions, but if you use SciPy
or NumPy in other types of technical computing, this approach may be more
transparent, generalizable, and easier to use.
I have an example Jupyter notebook in GitHub that describes what I mean;
here are the GitHub and nbviewer links:
https://github.com/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb
https://nbviewer.jupyter.org/github/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb
Here are some of the most important parts of the code for generating a
dendrogram.
1. Generate a numpy fingerprint matrix from a list of rdkit Molecules.
for smiles in smiles_list:
mol = Chem.MolFromSmiles(smiles)
mols.append(mol)
fingerprint_mat = np.vstack(np.asarray(rdmolops.RDKFingerprint(mol,
fpSize = 2048), dtype = 'bool') for mol in mols)
2. Generate the distance matrix. *pdist* and *squareform* are from
*scipy.spatial.distance*.
dist_mat = pdist(fingerprint_mat, 'jaccard') dist_df = pd.DataFrame(
squareform(dist_mat), index = smiles_list, columns= smiles_list)
As far as I can tell, the Jaccard distance is equivalent to one minus the
Tanimoto similarity.
3. Perform hierarchical clustering on the distance matrix and show the
dendrogram (see the github notebook for the plot). *hc* is
*scipy.cluster.hierarchy*.
z = hc.linkage(dist_mat)dendrogram = hc.dendrogram(z,
labels=dist_df.columns, leaf_rotation=90)plt.show()
A helpful page for dendrograms using SciPy is this one:
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
Good luck!
Curt
On Sat, May 14, 2016 at 9:11 AM, Robert DeLisle <rkdeli...@gmail.com> wrote:
> Next up is clustering...
>
> I've got about 350 structures to cluster and I've worked through the
> example code from the RDKit Cookbook (
> http://www.rdkit.org/docs/Cookbook.html#clustering-molecules). All seems
> well and good there, but I would like to see the dendrogram. I see that
> there is a ClusterVis module to generate images, PDF, and SVG, but all
> require a Cluster object as input. I don't find anywhere a description of
> acquiring or building that object based upon the results of clustering.
>
> Any tips?
>
> -Kirk
>
>
>
>
> ------------------------------------------------------------------------------
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss