Hello,
I've been experiment with the above module, and I am stuck with the cluster 
object, I cannot find how to cut the tree at a specified distance and get the 
list of cluster indices for the original molecules.

See below the code.
The 'Print()' function does show a tree and values for a 'Metric', but how 
would I process the object any further to get indeed a specific clustering and 
not a tree?
[E.g. in R one would use hclust and then cutree with h set to the desired 
distance, yielding a list of integer cluster indices (obviously starting at 1, 
for the joy of python aficionados :D)].

Thanks
Giovanni

import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs
from rdkit.ML.Cluster import Murtagh

# 20 SMILES from Enamine REAL
SMILES_list = ['O=C(c1nc(C2CC2)oc1)NC[C@@H]3CN(C(C4OCCO4)=O)CCC3',
'CC(OCc1onc(C(NC2(CC3(NC(C4[C@H]([C@@H]5C4)CC5)=O)C2)C3)=O)c1)(C)C |&1:17,16|',
'CC1(C(C(NCC2(CC2)NC(CN3C(=O)OCC3)=O)=O)CCCC1)C',
'CC(NC([C@H]1[C@@H](F)C1)=O)CC2CN(C(c3c(Br)cn(C)n3)=O)C2',
'Cc1nc(c2cc1)ccc(C(N(C(CNC(C3(C(C)(C)C3)O)=O)C)C)=O)c2',
'COc1c(S(N2Cc(n3CC2)cnc3)(=O)=O)cccc1[N+]([O-])=O',
'Cc1c(C(N2CCN(CC3OCCOC3)CC2)=O)cccc1O',
'Cc1cc(C)n(C(C(N2CCN(C(C(=O)N)=O)CCC2)=O)C)n1',
'CCOCC(C(N[C@H]1[C@H]2C[C@H](CN2C(C(C=NNC3=O)=C3)=O)C1)=O)C |&1:7,10,8|',
'Cc1ncsc1CC(NCC2CCN(C(CC3C(C)C3)=O)CC2)=O',
'Cc1nc(C=CC(NCC(N2CC(C)OC(C)C2)(C)C)=O)[nH]c1',
'Cc1nnsc1C(NC[C@H](NC(C2CC=CC2)=O)CO)=O',
'CC(C(N[C@H]1[C@H](CC2CC2)CN(C(c3[nH]ccc3)=O)C1)=O)(CC#N)C |&1:4,5|',
'CCOCCON=C1CCN(C(CCSC)=O)CC1',
'C[C@H](N(C(C1CCC=CC1)=O)C)CNC(c2cccc(OC(C)(C)C)c2)=O',
'Cc1nc(C(C)C)c(C(N2C[C@H](NC([C@@H]3C[C@H](C(=O)N)CC3)=O)CC2)=O)cc1 |&1:14,16|',
'CCc1occc1C(N2CC(OCc3nnn(C4CC4)c3)CCC2)=O',
'CN(C(CNC(C1CC1)=O)=O)CC2CCN(C(C3(N(C)CCC3)C)=O)CC2',
'CC(CCCC1)=C1C(N2CC3(CCC(NC(c4c(C)n[nH]n4)=O)CC3)CC2)=O',
'CCC(N1CC(C(O)(C)C)(CNC(Cn2c(c3cc2)cc(Cl)cc3)=O)C1)=O']

# Generate the fingerprints (Morgan radius 3, folded to 2048 bits)
fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(sm), 
radius = 3, nBits = 2048, useChirality = False) \
    for sm in SMILES_list]

# Generate the distance matrix in a standard list format
# see: https://www.rdkit.org/docs/source/rdkit.ML.Cluster.Murtagh.html
# for i<j: d_ij = dists[j*(j-1)//2 + i]
distmat_list = []
for i in range(1, len(fps)):
    sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
    distmat_list.extend([1 - s for s in sims])

# Create the hierarchial clustering object, using SLINK method from Murtagh
hcl_slink = Murtagh.ClusterData(distmat_list, nPts = len(fps), method = 
Murtagh.SLINK, isDistData = True)

hcl_slink[0].Print()

Cluster(39) Metric: 0.876033
  Cluster(6)      Metric: 0.000000
  Cluster(38)     Metric: 0.872549
    Cluster(14)   Metric: 0.000000
    Cluster(37)   Metric: 0.852174
      Cluster(29) Metric: 0.842593
        Cluster(3)      Metric: 0.000000
        Cluster(18)     Metric: 0.000000
      Cluster(36) Metric: 0.850877
        Cluster(7)      Metric: 0.000000
        Cluster(35)     Metric: 0.850394
          Cluster(17)   Metric: 0.000000
          Cluster(34)   Metric: 0.850000
            Cluster(19) Metric: 0.000000
            Cluster(33) Metric: 0.849558
              Cluster(11)     Metric: 0.000000
              Cluster(32)     Metric: 0.848739
                Cluster(13)   Metric: 0.000000
                Cluster(31)   Metric: 0.848739
                  Cluster(28) Metric: 0.842105
                    Cluster(26)     Metric: 0.833333
                      Cluster(1)    Metric: 0.000000
                      Cluster(25)   Metric: 0.831858
                        Cluster(10) Metric: 0.000000
                        Cluster(20) Metric: 0.000000
                    Cluster(27)     Metric: 0.834783
                      Cluster(9)    Metric: 0.000000
                      Cluster(24)   Metric: 0.830357
                        Cluster(4)  Metric: 0.000000
                        Cluster(23) Metric: 0.825243
                          Cluster(8)      Metric: 0.000000
                          Cluster(16)     Metric: 0.000000
                  Cluster(30) Metric: 0.842975
                    Cluster(2)      Metric: 0.000000
                    Cluster(22)     Metric: 0.813084
                      Cluster(12)   Metric: 0.000000
                      Cluster(21)   Metric: 0.791304
                        Cluster(5)  Metric: 0.000000
                        Cluster(15) Metric: 0.000000
This e-mail and its attachment(s) (if any) may contain confidential and/or 
proprietary information and is intended for its addressee(s) only. Any 
unauthorized use of the information contained herein (including, but not 
limited to, alteration, reproduction, communication, distribution or any other 
form of dissemination) is strictly prohibited. If you are not the intended 
addressee, please notify the originator promptly and delete this e-mail and its 
attachment(s) (if any) subsequently. Neither Galapagos nor any of its 
affiliates shall be liable for direct, special, indirect or consequential 
damages arising from alteration of the contents of this message (by a third 
party) or as a result of a virus being passed on.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to