Hi Andrew, I'm going to divide this into pieces in order to be able to answer in a reasonable amount of time. I'll do clarifying questions and quick answers in this one.
On Thu, May 26, 2011 at 4:02 PM, Andrew Dalke <da...@dalkescientific.com> wrote: > RDKit implements the MACCS keys as a set of SMARTS patterns, > plus a few bits coded by hand. > > I don't know how much people know the impact of this on the other > free software projects. OpenBabel and CDK both use copies of the > RDKit definitions for their own MACCS keys. While I've seen > earlier internal definitions, they were held rather closely, so > it's very nice to have a public definition. > > I'm reviewing the definitions as part of my chemfp project, > which is one of the advantages to having an open definition. > > I've got some question or suggestions about them. For reference, > see http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html > > > * Bit 1 is > > 1:('?',0), # ISOTOPE > > This explicitly isn't defined, but shouldn't it be [!*0] ? > > I tried out that SMARTS and I see that a SMILES of "C[14CH3]" > has two matches in RDKit to [!0*] but in OEChem there's only > one. I think the OEChem version is correct. I verified it at > > http://www.daylight.com/daycgi_tutorials/depictmatch.cgi > with the SMILES of C[14CH2][13CH3] and SMARTS of [!0*] . > Daylight matches 2 of the 3 atoms. > > I think this is a bug in RDKit, and once fixed it would > mean this bit could be supported. > My reading of the SMARTS theory manual (http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html) says that [0*] means "any atom with a mass of 0", so [!0*] would be "any atom that doesn't have a mass of 0". What am I missing? > > * Bit 2 is > > #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0), # ISOTOPE Not > complete > 2:('[#103,#104]',0), # ISOTOPE Not complete > > I assume the comment is wrong, since this has nothing to do with isotopes. > > What's not complete about this definition, and/or why is the first one > commented out? I've got to see if I can find a description of the bits and I'll come back to these definition questions. > > 18:('[B,Al,Ga,In,Tl]',0), # Group IIIA (B...) *NOTE* spec wrong > > Boron may be aromatic according to the SMILES spec, so this > should be [B,b, ...] or [#5, ... ]. > > Also, here's the aromatic elements in OpenBabel: > > [se] > [as] > [si] > [ge] > [sb] > [bi] > [te] > [sn] > > Not all of these are valid SMARTS according to Daylight, and > RDKit doesn't support the same set of aromatics, so for a > portable version (which I'm working on) they can be written as > > [#34], [#33], [#14], ... > > Oh, and aromatic lead has been synthesized > > http://www.rsc.org/chemistryworld/News/2010/April/15041002.asp > Agreed that using the generic atomic-number form makes a lot more sense. > * Bit 101 says: > > 8M Ring or larger. This only handles up to ring sizes of 14 > > Is it worthwhile to support larger rings? I don't think so. > If yes, then it could be dealt with outside of the SMARTS, > just like 125 and 166. Agreed that it's not really necessary to support larger rings. Systems with rings larger than 14 would end up missing a single bit. > BTW, I also verified that all of the CH2 atoms were written > as either [CH2] (if there are two bonds other atoms) or > [C;H2,H3] if there is only one bond (and similar with [NH2]). > While strange chemistries can cause this to fail as a > substructure filter, I recognize that that is outside > the scope of those definitions. it certainly is for me. :-) -greg ------------------------------------------------------------------------------ vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss