RDKit implements the MACCS keys as a set of SMARTS patterns, plus a few bits coded by hand.
I don't know how much people know the impact of this on the other free software projects. OpenBabel and CDK both use copies of the RDKit definitions for their own MACCS keys. While I've seen earlier internal definitions, they were held rather closely, so it's very nice to have a public definition. I'm reviewing the definitions as part of my chemfp project, which is one of the advantages to having an open definition. I've got some question or suggestions about them. For reference, see http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html * Bit 1 is 1:('?',0), # ISOTOPE This explicitly isn't defined, but shouldn't it be [!*0] ? I tried out that SMARTS and I see that a SMILES of "C[14CH3]" has two matches in RDKit to [!0*] but in OEChem there's only one. I think the OEChem version is correct. I verified it at http://www.daylight.com/daycgi_tutorials/depictmatch.cgi with the SMILES of C[14CH2][13CH3] and SMARTS of [!0*] . Daylight matches 2 of the 3 atoms. I think this is a bug in RDKit, and once fixed it would mean this bit could be supported. * Bit 2 is #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0), # ISOTOPE Not complete 2:('[#103,#104]',0), # ISOTOPE Not complete I assume the comment is wrong, since this has nothing to do with isotopes. What's not complete about this definition, and/or why is the first one commented out? * "*NOTE* spec wrong" occurs on many lines What does it mean? * Bit 3 is 3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-6 (Ge...) *NOTE* spec wrong The "Tl" doesn't look right. Shouldn't the last three be Pb,Bi,Po ? * Bit 18 is 18:('[B,Al,Ga,In,Tl]',0), # Group IIIA (B...) *NOTE* spec wrong Boron may be aromatic according to the SMILES spec, so this should be [B,b, ...] or [#5, ... ]. Also, here's the aromatic elements in OpenBabel: [se] [as] [si] [ge] [sb] [bi] [te] [sn] Not all of these are valid SMARTS according to Daylight, and RDKit doesn't support the same set of aromatics, so for a portable version (which I'm working on) they can be written as [#34], [#33], [#14], ... Oh, and aromatic lead has been synthesized http://www.rsc.org/chemistryworld/News/2010/April/15041002.asp * Bit 44 is 44:('?',0), # OTHER Is this one of the undocumented bits or does "OTHER" mean something else? * Bit 68 says FIX: incomplete definition Are there thoughts to complete this? My thought is that it isn't important one way or the other. Without a good validation set it would be hard to really pin this down. There are a number of other bits which are also marked "FIX: incomplete definition". Are they going to be fixed? Again, I don't think there's a pressing need without validation data. * Bit 101 says: 8M Ring or larger. This only handles up to ring sizes of 14 Is it worthwhile to support larger rings? I don't think so. If yes, then it could be dealt with outside of the SMARTS, just like 125 and 166. BTW, I also verified that all of the CH2 atoms were written as either [CH2] (if there are two bonds other atoms) or [C;H2,H3] if there is only one bond (and similar with [NH2]). While strange chemistries can cause this to fail as a substructure filter, I recognize that that is outside the scope of those definitions. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss