Hello, I've been experiment with the above module, and I am stuck with the cluster object, I cannot find how to cut the tree at a specified distance and get the list of cluster indices for the original molecules.
See below the code. The 'Print()' function does show a tree and values for a 'Metric', but how would I process the object any further to get indeed a specific clustering and not a tree? [E.g. in R one would use hclust and then cutree with h set to the desired distance, yielding a list of integer cluster indices (obviously starting at 1, for the joy of python aficionados :D)]. Thanks Giovanni import pandas as pd from rdkit import Chem from rdkit.Chem import rdMolDescriptors from rdkit import DataStructs from rdkit.ML.Cluster import Murtagh # 20 SMILES from Enamine REAL SMILES_list = ['O=C(c1nc(C2CC2)oc1)NC[C@@H]3CN(C(C4OCCO4)=O)CCC3', 'CC(OCc1onc(C(NC2(CC3(NC(C4[C@H]([C@@H]5C4)CC5)=O)C2)C3)=O)c1)(C)C |&1:17,16|', 'CC1(C(C(NCC2(CC2)NC(CN3C(=O)OCC3)=O)=O)CCCC1)C', 'CC(NC([C@H]1[C@@H](F)C1)=O)CC2CN(C(c3c(Br)cn(C)n3)=O)C2', 'Cc1nc(c2cc1)ccc(C(N(C(CNC(C3(C(C)(C)C3)O)=O)C)C)=O)c2', 'COc1c(S(N2Cc(n3CC2)cnc3)(=O)=O)cccc1[N+]([O-])=O', 'Cc1c(C(N2CCN(CC3OCCOC3)CC2)=O)cccc1O', 'Cc1cc(C)n(C(C(N2CCN(C(C(=O)N)=O)CCC2)=O)C)n1', 'CCOCC(C(N[C@H]1[C@H]2C[C@H](CN2C(C(C=NNC3=O)=C3)=O)C1)=O)C |&1:7,10,8|', 'Cc1ncsc1CC(NCC2CCN(C(CC3C(C)C3)=O)CC2)=O', 'Cc1nc(C=CC(NCC(N2CC(C)OC(C)C2)(C)C)=O)[nH]c1', 'Cc1nnsc1C(NC[C@H](NC(C2CC=CC2)=O)CO)=O', 'CC(C(N[C@H]1[C@H](CC2CC2)CN(C(c3[nH]ccc3)=O)C1)=O)(CC#N)C |&1:4,5|', 'CCOCCON=C1CCN(C(CCSC)=O)CC1', 'C[C@H](N(C(C1CCC=CC1)=O)C)CNC(c2cccc(OC(C)(C)C)c2)=O', 'Cc1nc(C(C)C)c(C(N2C[C@H](NC([C@@H]3C[C@H](C(=O)N)CC3)=O)CC2)=O)cc1 |&1:14,16|', 'CCc1occc1C(N2CC(OCc3nnn(C4CC4)c3)CCC2)=O', 'CN(C(CNC(C1CC1)=O)=O)CC2CCN(C(C3(N(C)CCC3)C)=O)CC2', 'CC(CCCC1)=C1C(N2CC3(CCC(NC(c4c(C)n[nH]n4)=O)CC3)CC2)=O', 'CCC(N1CC(C(O)(C)C)(CNC(Cn2c(c3cc2)cc(Cl)cc3)=O)C1)=O'] # Generate the fingerprints (Morgan radius 3, folded to 2048 bits) fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(sm), radius = 3, nBits = 2048, useChirality = False) \ for sm in SMILES_list] # Generate the distance matrix in a standard list format # see: https://www.rdkit.org/docs/source/rdkit.ML.Cluster.Murtagh.html # for i<j: d_ij = dists[j*(j-1)//2 + i] distmat_list = [] for i in range(1, len(fps)): sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i]) distmat_list.extend([1 - s for s in sims]) # Create the hierarchial clustering object, using SLINK method from Murtagh hcl_slink = Murtagh.ClusterData(distmat_list, nPts = len(fps), method = Murtagh.SLINK, isDistData = True) hcl_slink[0].Print() Cluster(39) Metric: 0.876033 Cluster(6) Metric: 0.000000 Cluster(38) Metric: 0.872549 Cluster(14) Metric: 0.000000 Cluster(37) Metric: 0.852174 Cluster(29) Metric: 0.842593 Cluster(3) Metric: 0.000000 Cluster(18) Metric: 0.000000 Cluster(36) Metric: 0.850877 Cluster(7) Metric: 0.000000 Cluster(35) Metric: 0.850394 Cluster(17) Metric: 0.000000 Cluster(34) Metric: 0.850000 Cluster(19) Metric: 0.000000 Cluster(33) Metric: 0.849558 Cluster(11) Metric: 0.000000 Cluster(32) Metric: 0.848739 Cluster(13) Metric: 0.000000 Cluster(31) Metric: 0.848739 Cluster(28) Metric: 0.842105 Cluster(26) Metric: 0.833333 Cluster(1) Metric: 0.000000 Cluster(25) Metric: 0.831858 Cluster(10) Metric: 0.000000 Cluster(20) Metric: 0.000000 Cluster(27) Metric: 0.834783 Cluster(9) Metric: 0.000000 Cluster(24) Metric: 0.830357 Cluster(4) Metric: 0.000000 Cluster(23) Metric: 0.825243 Cluster(8) Metric: 0.000000 Cluster(16) Metric: 0.000000 Cluster(30) Metric: 0.842975 Cluster(2) Metric: 0.000000 Cluster(22) Metric: 0.813084 Cluster(12) Metric: 0.000000 Cluster(21) Metric: 0.791304 Cluster(5) Metric: 0.000000 Cluster(15) Metric: 0.000000 This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss