On 06/11/2010 16:12, Andrew Dalke wrote:
Hi all,

  I've found some rather serious problems in the OB MACCS definitions in the 
2.3 release.

I'm working more on my fingerprint generation codes, in preparation for my 
poster at the GCC conference in Goslar. In order to test the bit order, I 
constructed SMILES strings designed to hit specific bits in the MACCS keys.

  The test case "C1CCC1" should produce a "1" at position 11 (bit 10 if you 
count from 0). RDKit and OEChem both do this, but OpenBabel does not. I looked up the OB's pattern 
definition and found:

11:('*1~*~*~*~*1',0), # 4M Ring *NOTE* Was '*1~*~*~*~1' This and 9 others 
changed by CM because OB didn't like it

This pattern is incorrect. It matches a 5-membered ring. I tested the MACCS fingerprint 
with the 5 membered ring "C1CCCC1" and sure enough, OB gives a 1 for position 
11.

With the 4-membered ring C1CCC1 I see that OB sets position 22 (bit 21) to 1, 
when it should be 0. Here's the OB SMARTS definition for that position

22:('*1~*~*~*1',0), # 3M Ring

Here's the corresponding line from rdkit/Chem/MACCSkeys.py

22:('*1~*~*~1',0), # 3M Ring

You can see that here also the modified OB definition is looking for one too 
many atoms.

I haven't done a full analysis of which other bits are incorrect in this way, 
but what I've found is enough to say that people shouldn't use OB's MACCS 
definitions until they've been reviewed and fixed.

Andrew

Thanks for raising this.

I have updated the data file MACCS.txt from the RDKit trunk and the SMARTS patterns are now exactly the same. The data file with corrected data is attached and can just replace the original in OpenBabel v2.3.0 and earlier.

Although the modifications were rather unintelligently carried out, they were originally necessary because the OB's SMARTS parser required a ring digit to follow an atom, but it seems that now it can follow a bond.

Chris
#Comments after SMARTS
# Extracted from RDKit r1553 Nov 2010 rdkit/Chem/MACCSKeys.py
#
# Copyright (C) 2001-2008 greg Landrum and Rational Discovery LLC
#
#   @@ All Rights Reserved @@
#  This file is part of the RDKit.
#  The contents are covered by the terms of the BSD license
#  which is included in the file license.txt, found at the root
#  of the RDKit source tree.
#
# SMARTS definitions for the publically available MACCS keys

# I compared the MACCS fingerprints generated here with those from two
# other packages (not MDL, unfortunately). Of course there are
# disagreements between the various fingerprints still, but I think
# these definitions work pretty well. Some notes:

# 1) most of the differences have to do with aromaticity
# 2) there's a discrepancy sometimes because the current RDKit
# definitions do not require multiple matches to be distinct. e.g. the
# SMILES C(=O)CC(=O) can match the (hypothetical) key O=CC twice in my
# definition. It's not clear to me what the correct behavior is.
# 3) Some keys are not fully defined in the MDL documentation
# 4) Two keys, 125 and 166, have to be done outside of SMARTS.
# 5) Key 1 (ISOTOPE) isn't defined

# these are SMARTS patterns corresponding to the MDL MACCS keys
  1:('?',0), # ISOTOPE
  #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0),  # ISOTOPE Not 
complete
  2:('[#103,#104]',0),  # ISOTOPE Not complete
  3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-6 (Ge...)  
*NOTE* spec wrong
  4:('[Ac,Th,Pa,U,Np,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr]',0), # actinide
  5:('[Sc,Ti,Y,Zr,Hf]',0), # Group IIIB,IVB (Sc...)  *NOTE* spec wrong
  6:('[La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]',0), # Lanthanide
  7:('[V,Cr,Mn,Nb,Mo,Tc,Ta,W,Re]',0), # Group VB,VIB,VIIB (V...) *NOTE* spec 
wrong
  8:('[!#6;!#1]1~*~*~*~1',0), # q...@1
  9:('[Fe,Co,Ni,Ru,Rh,Pd,Os,Ir,Pt]',0), # Group VIII (Fe...)
  10:('[Be,Mg,Ca,Sr,Ba,Ra]',0), # Group IIa (Alkaline earth)
  11:('*1~*~*~*~1',0), # 4M Ring
  12:('[Cu,Zn,Ag,Cd,Au,Hg]',0), # Group IB,IIB (Cu..)
  13:('[#8]~[#7](~[#6])~[#6]',0), # ON(C)C
  14:('[#16]-[#16]',0), # S-S
  15:('[#8]~[#6](~[#8])~[#8]',0), # OC(O)O
  16:('[!#6;!#1]1~*~*~1',0), # q...@1
  17:('[#6]#[#6]',0), #CTC
  18:('[B,Al,Ga,In,Tl]',0), # Group IIIA (B...) *NOTE* spec wrong
  19:('*1~*~*~*~*~*~*~1',0), # 7M Ring
  20:('[Si]',0), #Si
  21:('[#6]=[#6](~[!#6;!#1])~[!#6;!#1]',0), # C=C(Q)Q
  22:('*1~*~*~1',0), # 3M Ring
  23:('[#7]~[#6](~[#8])~[#8]',0), # NC(O)O
  24:('[#7]-[#8]',0), # N-O
  25:('[#7]~[#6](~[#7])~[#7]',0), # NC(N)N
  26:('[#6]=;@[#6](@*)@*',0), # C$=C($A)$A
  27:('[I]',0), # I
  28:('[!#6;!#1]~[CH2]~[!#6;!#1]',0), # QCH2Q
  29:('[#15]',0),# P
  30:('[#6]~[!#6;!#1](~[#6])(~[#6])~*',0), # CQ(C)(C)A
  31:('[!#6;!#1]~[F,Cl,Br,I]',0), # QX
  32:('[#6]~[#16]~[#7]',0), # CSN
  33:('[#7]~[#16]',0), # NS
  34:('[CH2]=*',0), # CH2=A
  35:('[Li,Na,K,Rb,Cs,Fr]',0), # Group IA (Alkali Metal)
  36:('[#16R]',0), # S Heterocycle
  37:('[#7]~[#6](~[#8])~[#7]',0), # NC(O)N
  38:('[#7]~[#6](~[#6])~[#7]',0), # NC(C)N
  39:('[#8]~[#16](~[#8])~[#8]',0), # OS(O)O
  40:('[#16]-[#8]',0), # S-O
  41:('[#6]#[#7]',0), # CTN
  42:('F',0), # F
  43:('[!C;!c;!#1;!H0]~*~[!C;!c;!#1;!H0]',0), # QHAQH
  44:('?',0), # OTHER
  45:('[#6]=[#6]~[#7]',0), # C=CN
  46:('Br',0), # BR
  47:('[#16]~*~[#7]',0), # SAN
  48:('[#8]~[!#6;!#1](~[#8])(~[#8])',0), # OQ(O)O
  49:('[!+0]',0), # CHARGE  
  50:('[#6]=[#6](~[#6])~[#6]',0), # C=C(C)C
  51:('[#6]~[#16]~[#8]',0), # CSO
  52:('[#7]~[#7]',0), # NN
  53:('[!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0]',0), # QHAAAQH
  54:('[!#6;!#1;!H0]~*~*~[!#6;!#1;!H0]',0), # QHAAQH
  55:('[#8]~[#16]~[#8]',0), #OSO
  56:('[#8]~[#7](~[#8])~[#6]',0), # ON(O)C
  57:('[#8R]',0), # O Heterocycle
  58:('[!#6;!#1]~[#16]~[!#6;!#1]',0), # QSQ
  59:('[#16]!:*:*',0), # Snot%A%A
  60:('[#16]=[#8]',0), # S=O
  61:('*~[#16](~*)~*',0), # AS(A)A
  62:('*...@*!@*...@*',0), # A$!A$A
  63:('[#7]=[#8]',0), # N=O
  64:('*...@*!@[#16]',0), # A$A!S
  65:('c:n',0), # C%N
  66:('[#6]~[#6](~[#6])(~[#6])~*',0), # CC(C)(C)A
  67:('[!#6;!#1]~[#16]',0), # QS
  68:('[!#6;!#1;!H0]~[!#6;!#1;!H0]',0), # QHQH (&...) FIX: incomplete definition
  69:('[!#6;!#1]~[!#6;!#1;!H0]',0), # QQH
  70:('[!#6;!#1]~[#7]~[!#6;!#1]',0), # QNQ
  71:('[#7]~[#8]',0), # NO
  72:('[#8]~*~*~[#8]',0), # OAAO
  73:('[#16]=*',0), # S=A
  74:('[CH3]~*~[CH3]',0), # CH3ACH3
  75:('*...@[#7]@*',0), # A!N$A
  76:('[#6]=[#6](~*)~*',0), # C=C(A)A
  77:('[#7]~*~[#7]',0), # NAN
  78:('[#6]=[#7]',0), # C=N
  79:('[#7]~*~*~[#7]',0), # NAAN
  80:('[#7]~*~*~*~[#7]',0), # NAAAN
  81:('[#16]~*(~*)~*',0), # SA(A)A
  82:('*~[CH2]~[!#6;!#1;!H0]',0), # ACH2QH
  83:('[!#6;!#1]1~*~*~*~*~1',0), # qa...@1
  84:('[NH2]',0), #NH2
  85:('[#6]~[#7](~[#6])~[#6]',0), # CN(C)C
  86:('[C;H2,H3][!#6;!#1][C;H2,H3]',0), # CH2QCH2
  87:('[F,Cl,Br,i...@*@*',0), # X!A$A
  88:('[#16]',0), # S
  89:('[#8]~*~*~*~[#8]',0), # OAAAO
  
90:('[$([!#6;!#1;!H0]~*~*~[CH2]~*),$([!#6;!#1;!H0;r...@[r]@[...@[ch2;R]1),$([!#6;!#1;!h0]~[...@[r]@[CH2;R]1)]',0),
 # QHAACH2A
  
91:('[$([!#6;!#1;!H0]~*~*~*~[CH2]~*),$([!#6;!#1;!H0;r...@[r]@[...@[r]@[CH2;R]1),$([!#6;!#1;!h0]~[...@[r]@[...@[ch2;R]1),$([!#6;!#1;!h0]~*~[...@[r]@[CH2;R]1)]',0),
 # QHAAACH2A
  92:('[#8]~[#6](~[#7])~[#6]',0), # OC(N)C
  93:('[!#6;!#1]~[CH3]',0), # QCH3
  94:('[!#6;!#1]~[#7]',0), # QN
  95:('[#7]~*~*~[#8]',0), # NAAO
  96:('*1~*~*~*~*~1',0), # 5 M ring
  97:('[#7]~*~*~*~[#8]',0), # NAAAO
  98:('[!#6;!#1]1~*~*~*~*~*~1',0), # qaa...@1
  99:('[#6]=[#6]',0), # C=C
  100:('*~[CH2]~[#7]',0), # ACH2N
  
101:('[$([...@1@[...@[r]@[...@[r]@[...@[r]@[R]1),$([...@1@[...@[r]@[...@[r]@[...@[r]@[...@[r]1),$([...@1@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[R]1),$([...@1@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[...@[r]1),$([...@1@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[R]1),$([...@1@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[...@[r]1),$([...@1@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[...@[r]@[R]1)]',0),
 # 8M Ring or larger. This only handles up to ring sizes of 14
  102:('[!#6;!#1]~[#8]',0), # QO
  103:('Cl',0), # CL
  104:('[!#6;!#1;!H0]~*~[CH2]~*',0), # QHACH2A
  105:('*...@*(@*)@*',0), # A$A($A)$A
  106:('[!#6;!#1]~*(~[!#6;!#1])~[!#6;!#1]',0), # QA(Q)Q
  107:('[F,Cl,Br,I]~*(~*)~*',0), # XA(A)A
  108:('[CH3]~*~*~*~[CH2]~*',0), # CH3AAACH2A
  109:('*~[CH2]~[#8]',0), # ACH2O
  110:('[#7]~[#6]~[#8]',0), # NCO
  111:('[#7]~*~[CH2]~*',0), # NACH2A
  112:('*~*(~*)(~*)~*',0), # AA(A)(A)A
  113:('[#8]!:*:*',0), # Onot%A%A
  114:('[CH3]~[CH2]~*',0), # CH3CH2A
  115:('[CH3]~*~[CH2]~*',0), # CH3ACH2A
  116:('[$([CH3]~*~*~[CH2]~*),$([CH3]~*1~*~[CH2]1)]',0), # CH3AACH2A
  117:('[#7]~*~[#8]',0), # NAO
  118:('[$(*~[CH2]~[CH2]~*),$(*1~[CH2]~[CH2]1)]',1), # ACH2CH2A > 1
  119:('[#7]=*',0), # N=A
  120:('[!#6;R]',1), # Heterocyclic atom > 1 (&...) FIX: incomplete definition
  121:('[#7;R]',0), # N Heterocycle
  122:('*~[#7](~*)~*',0), # AN(A)A
  123:('[#8]~[#6]~[#8]',0), # OCO
  124:('[!#6;!#1]~[!#6;!#1]',0), # QQ
  125:('?',0), # Aromatic Ring > 1
  126:('*...@[#8]!@*',0), # A!O!A
  127:('*...@*!@[#8]',1), # A$A!O > 1 (&...) FIX: incomplete definition
  
128:('[$(*~[CH2]~*~*~*~[CH2]~*),$([...@[ch2;r...@[r]@[...@[r]@[CH2;R]1),$(*~[ch2]~[...@[r]@[...@[ch2;R]1),$(*~[ch2]~*~[...@[r]@[CH2;R]1)]',0),
 # ACH2AAACH2A
  
129:('[$(*~[CH2]~*~*~[CH2]~*),$([...@[ch2]@[...@[r]@[CH2;R]1),$(*~[ch2]~[...@[r]@[CH2;R]1)]',0),
 # ACH2AACH2A
  130:('[!#6;!#1]~[!#6;!#1]',1), # QQ > 1 (&...)  FIX: incomplete definition
  131:('[!#6;!#1;!H0]',1), # QH > 1
  132:('[#8]~*~[CH2]~*',0), # OACH2A
  133:('*...@*!@[#7]',0), # A$A!N
  134:('[F,Cl,Br,I]',0), # X (HALOGEN)
  135:('[#7]!:*:*',0), # Nnot%A%A
  136:('[#8]=*',1), # O=A>1 
  137:('[!C;!c;R]',0), # Heterocycle
  138:('[!#6;!#1]~[CH2]~*',1), # QCH2A>1 (&...) FIX: incomplete definition
  139:('[O;!H0]',0), # OH
  140:('[#8]',3), # O > 3 (&...) FIX: incomplete definition
  141:('[CH3]',2), # CH3 > 2  (&...) FIX: incomplete definition
  142:('[#7]',1), # N > 1
  143:('*...@*!@[#8]',0), # A$A!O
  144:('*!:*:*!:*',0), # Anot%A%Anot%A
  145:('*1~*~*~*~*~*~1',1), # 6M ring > 1
  146:('[#8]',2), # O > 2
  147:('[$(*~[CH2]~[CH2]~*),$([...@[ch2;r...@[ch2;R]1)]',0), # ACH2CH2A
  148:('*~[!#6;!#1](~*)~*',0), # AQ(A)A
  149:('[C;H3,H4]',1), # CH3 > 1
  150:('*...@*@*...@*',0), # A!A$A!A
  151:('[#7;!H0]',0), # NH
  152:('[#8]~[#6](~[#6])~[#6]',0), # OC(C)C
  153:('[!#6;!#1]~[CH2]~*',0), # QCH2A
  154:('[#6]=[#8]',0), # C=O
  155:('*...@[ch2]!@*',0), # A!CH2!A
  156:('[#7]~*(~*)~*',0), # NA(A)A
  157:('[#6]-[#8]',0), # C-O
  158:('[#6]-[#7]',0), # C-N
  159:('[#8]',1), # O>1
  160:('[C;H3,H4]',0), #CH3
  161:('[#7]',0), # N
  162:('a',0), # Aromatic
  163:('*1~*~*~*~*~*~1',0), # 6M Ring
  164:('[#8]',0), # O
  165:('[R]',0), # Ring
  166:('?',0), # Fragments  FIX: this can't be done in SMARTS

# obabel -:"CNO" -oftp -xs
# 24: N-O       68: QHQH (&...)         69: QQH 71: NO  93: QCH3        94: QN  
102: QO
# 124: QQ       131: QH > 1     *2  139: OH     151: NH 158: C-N        160: 
CH3        161: N  164: O  
------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to