RDKit implements the MACCS keys as a set of SMARTS patterns,
plus a few bits coded by hand.

I don't know how much people know the impact of this on the other
free software projects. OpenBabel and CDK both use copies of the
RDKit definitions for their own MACCS keys. While I've seen
earlier internal definitions, they were held rather closely, so
it's very nice to have a public definition.

I'm reviewing the definitions as part of my chemfp project,
which is one of the advantages to having an open definition.

I've got some question or suggestions about them. For reference,
see http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html


* Bit 1 is

   1:('?',0), # ISOTOPE 

This explicitly isn't defined, but shouldn't it be [!*0] ?

I tried out that SMARTS and I see that a SMILES of "C[14CH3]"
has two matches in RDKit to [!0*] but in OEChem there's only
one. I think the OEChem version is correct. I verified it at

http://www.daylight.com/daycgi_tutorials/depictmatch.cgi
with the SMILES of C[14CH2][13CH3] and SMARTS of [!0*] .
Daylight matches 2 of the 3 atoms.

I think this is a bug in RDKit, and once fixed it would
mean this bit could be supported.


* Bit 2 is

  #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0),  # ISOTOPE Not 
complete 
   2:('[#103,#104]',0),  # ISOTOPE Not complete 

I assume the comment is wrong, since this has nothing to do with isotopes.

What's not complete about this definition, and/or why is the first one 
commented out?

* "*NOTE* spec wrong" occurs on many lines

What does it mean?

* Bit 3 is

 3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-6 (Ge...)  
*NOTE* spec wrong 

The "Tl" doesn't look right. Shouldn't the last three be Pb,Bi,Po ?

*  Bit 18 is

   18:('[B,Al,Ga,In,Tl]',0), # Group IIIA (B...) *NOTE* spec wrong 

Boron may be aromatic according to the SMILES spec, so this
should be [B,b, ...] or [#5, ... ].

Also, here's the aromatic elements in OpenBabel:

[se]
[as]
[si]
[ge]
[sb]
[bi]
[te]
[sn]

Not all of these are valid SMARTS according to Daylight, and
RDKit doesn't support the same set of aromatics, so for a
portable version (which I'm working on) they can be written as

[#34], [#33], [#14], ...

Oh, and aromatic lead has been synthesized

 http://www.rsc.org/chemistryworld/News/2010/April/15041002.asp


*  Bit 44 is

  44:('?',0), # OTHER 

Is this one of the undocumented bits or does "OTHER" mean
something else?

*  Bit 68 says

    FIX: incomplete definition

Are there thoughts to complete this? My thought is that it isn't
important one way or the other. Without a good validation set
it would be hard to really pin this down.

There are a number of other bits which are also marked "FIX:
incomplete definition". Are they going to be fixed? Again, I
don't think there's a pressing need without validation data.

* Bit 101 says:

  8M Ring or larger. This only handles up to ring sizes of 14 

Is it worthwhile to support larger rings? I don't think so.
If yes, then it could be dealt with outside of the SMARTS,
just like 125 and 166.


BTW, I also verified that all of the CH2 atoms were written
as either [CH2] (if there are two bonds other atoms) or
[C;H2,H3] if there is only one bond (and similar with [NH2]).
While strange chemistries can cause this to fail as a
substructure filter, I recognize that that is outside
the scope of those definitions. 

Cheers,


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
vRanger cuts backup time in half-while increasing security.
With the market-leading solution for virtual backup and recovery, 
you get blazing-fast, flexible, and affordable data protection.
Download your free trial now. 
http://p.sf.net/sfu/quest-d2dcopy1
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to