Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
I'm traveling for the next week, without a laptop, so I'm not really going to be able to look at stuff until the 7th of June. -greg On Sunday, May 29, 2011, Andrew Dalke da...@dalkescientific.com wrote: Like I said, working on the validation code is very hard. Or at least tedious. There's only 25 bits more to write check cases for. One of them is bit 141, defined a CH3 2 That is, at least three matches to the SMARTS [CH3] Then down in bit 160 it's CH3 with at least 1 match to the SMARTS [C;H3,H4]. I think the bit 141 should have the same SMARTS, to include CH4. It's hard to construct a real case where this will make a difference so I'm not sure this is even appropriate. Greg? Any thoughts? Andrew da...@dalkescientific.com -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
Like I said, working on the validation code is very hard. Or at least tedious. There's only 25 bits more to write check cases for. One of them is bit 141, defined a CH3 2 That is, at least three matches to the SMARTS [CH3] Then down in bit 160 it's CH3 with at least 1 match to the SMARTS [C;H3,H4]. I think the bit 141 should have the same SMARTS, to include CH4. It's hard to construct a real case where this will make a difference so I'm not sure this is even appropriate. Greg? Any thoughts? Andrew da...@dalkescientific.com -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
Hi Greg, My reading of the SMARTS theory manual (http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html) says that [0*] means any atom with a mass of 0, so [!0*] would be any atom that doesn't have a mass of 0. What am I missing? In the Daylight, OpenEye, and OpenBabel data models, an incoming atom which doesn't have an assigned isotope number is given the isotope number of 0. That is, they treat [0S] the same as [S]. I just posted an email to the BlueObelisk-SMILES list on this topic. The OpenSMILES spec says that these two atoms should be different, but I don't think that's right. A problem with the Daylight docs is that they don't distinguish between atomic weight/atomic mass and isotope number. For example, at the API level, to get the isotope number you call dt_weight http://www.daylight.com/dayhtml/doc/man/man3/dt_weight.html dt_weight(dt_Handle) = dt_Integer meaning that mass == weight == isotope is always treated as an int. I see that RDKit doesn't store the isotope, but instead tracks the atomic mass instead. I don't believe that's the right solution. Agreed that using the generic atomic-number form makes a lot more sense. When my updated definitions, with atomic number, are available, I'll let you know. Grrr (in a chuckling sort of way)! Now I have to resynchonize my definitions to the changes you just made! Andrew da...@dalkescientific.com -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
On May 27, 2011, at 6:01 AM, Greg Landrum wrote: And now a more philosophical point about this. ... The idea of the MACCS keys is simple: a limited set of structural keys that can be used to speed up substructure searches and which have since been (ab)used for chemical similarity. It seems like it would be a lot more helpful to the community if we had a set of keys like this that is based on a truly open definition. ... What do you think Andrew? Want to work together on this? Sure! I've been working on my PubChem-like substructure keys all this week. The pattern definitions are available at http://code.google.com/p/chem-fingerprints/source/browse/chemfp/substruct.patterns Validation is the hardest part, since I mostly only have the PubChem substructure bits as an oracle of what I'm supposed to get. I think I'm down to differences in how CACTVS does aromaticity (lots of mismatches because of that!) and lack of support for PubChem's PUBCHEM_NONSTANDARDBOND bond definitions ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_sdtags.txt I've got implementations for OpenBabel, RDKit, and OEChem. If you want to try it out, after you've installed the package, rdkit2fps --substruct $STRUCTURE_FILENAME I've also converted RDKit's MACCS patterns into my format definition at http://code.google.com/p/chem-fingerprints/source/browse/chemfp/rdmaccs.patterns which I've used in part as a cross-test to make sure my implementation using RDKit matches RDKit's own implementation. I've been calling it rdmaccs. Any problems with that? Want another name? My hope is to get this out in a 0.95 (or perhaps 1.0 alpha?) build today and announce it. What's mostly lacking are: - full validation (very hard, given aromaticity differences) - a solid test suite (that's amazingly hard to do) - documentation Oh yeah, and write up some sort of paper on what I did. Andrew da...@dalkescientific.com -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
On Fri, May 27, 2011 at 12:23 PM, Andrew Dalke da...@dalkescientific.com wrote: Hi Greg, My reading of the SMARTS theory manual (http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html) says that [0*] means any atom with a mass of 0, so [!0*] would be any atom that doesn't have a mass of 0. What am I missing? In the Daylight, OpenEye, and OpenBabel data models, an incoming atom which doesn't have an assigned isotope number is given the isotope number of 0. That is, they treat [0S] the same as [S]. That is definitely wrong according to the Daylight theory manual: Isotopic specifications are indicated by preceding the atomic symbol with a number equal to the desired integral atomic mass. An atomic mass can only be specified inside brackets. So [0S] would be S with an atomic mass of 0. I just posted an email to the BlueObelisk-SMILES list on this topic. The OpenSMILES spec says that these two atoms should be different, but I don't think that's right. We can agree to change it, but it's certainly consistent with what Daylight says in the theory manual. -greg -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
On Fri, May 27, 2011 at 3:47 PM, Andrew Dalke da...@dalkescientific.com wrote: On May 27, 2011, at 1:25 PM, Greg Landrum wrote: That is definitely wrong according to the Daylight theory manual: Isotopic specifications are indicated by preceding the atomic symbol with a number equal to the desired integral atomic mass. Yes, and I think they are being imprecise, but since SMILES is meant for normal chemistry, it's in an area where imprecision doesn't make much difference. Where does it make a difference? High resolution mass spec, for one. The mass of 28Si is not 28.0 but 27.9769265325. No arguments here. But that doesn't address the [0Si] question. I've been looking at how RDKit handles isotopes/mass, and I think there are some good examples of how its current approach can cause confusion. There is a lot of room for improvement in the way the RDKit handles isotopes. (I'm being polite to myself). When I have the free day for RDKit backend work, I need to go back and re-examine the way this is done. For those who haven't reviewed the code, RDKit turns [Si] into an Atom instance with mass of 28.086, that being the average abundance of silicon. correct. To generate the isomeric SMILES, RDKit looks at the mass. If it's more than 0.1 amu difference from the integral atomic mass (28 in this case) then it puts in the atomic mass. Otherwise it omits the abundance. Thus, since || 28.086 - 28 || = 0.1 Input: [Si] gives Output: [Si] Suppose I have isotopically pure silicon [28Si]. RDKit turns this into an Atom with mass 28.. If I generate the isomeric SMILES I get that || 28. - 28 || = 0.1 which means no atomic number will be displayed in the output, so Input: [28Si] gives Output: [Si] I tested this with Pubchem compound CID 21732668. It has an isomeric SMILES of F[28Si](F)(F)P([28Si](F)(F)F)[28Si](F)(F)F RDKit converts that into an isomeric SMILES of F[Si](F)(F)P([Si](F)(F)F)[Si](F)(F)F In other words, the generated SMILES is no longer isotopically pure. I believe this is wrong. You will get no argument from me. It's wrong. As it stands, the only way to tell if a given atom is supposed to be isotopically pure is to see if atom.GetMass() == int(atom.GetMass()) This will only fail for Tc, Pm, Po, At, and the other elements which have only very unstable isotopes, and hence where the idea of average abundance makes no sense. So for purposes of the first bit in the MACCS definition, I propose using something like: def has_specified_isotope(mol): for atom in mol.GetAtoms(): mass = atom.GetMass() if mass == int(mass): return True return False BTW, checking out of curiosity, I see that elements 106 (Sg) and higher have a isotopic mass defect which is greater than 0.1 amu. If RDKit supported Sg then it would always turn Input: [Sg] into Output: [106Sg] when making the isomeric SMILES. http://en.wikipedia.org/wiki/Isotopes_of_seaborgium http://en.wikipedia.org/wiki/Seaborgium PubChem does not have any of the reported Sg containing molecules. In fact: Failed to decode the following as a Molecular Formula or a CID: SgO3 It seems that no molecule containing Sg is in PubChem. We can agree to change it, but it's certainly consistent with what Daylight says in the theory manual. The problem above arises because RDKit uses an average mass when no mass is specified. The object model in the manual only allows integer masses, and the Daylight API agrees with that. I therefore don't see how RDKit's behavior is consistent. It's consistent to within roundoff error if you specify an isotope. The theory manual says if you don't specify anything, it's unspecified mass. I interpreted that to mean average atomic mass. -greg -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] MACCS SMARTS pattern definitions
RDKit implements the MACCS keys as a set of SMARTS patterns, plus a few bits coded by hand. I don't know how much people know the impact of this on the other free software projects. OpenBabel and CDK both use copies of the RDKit definitions for their own MACCS keys. While I've seen earlier internal definitions, they were held rather closely, so it's very nice to have a public definition. I'm reviewing the definitions as part of my chemfp project, which is one of the advantages to having an open definition. I've got some question or suggestions about them. For reference, see http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html * Bit 1 is 1:('?',0), # ISOTOPE This explicitly isn't defined, but shouldn't it be [!*0] ? I tried out that SMARTS and I see that a SMILES of C[14CH3] has two matches in RDKit to [!0*] but in OEChem there's only one. I think the OEChem version is correct. I verified it at http://www.daylight.com/daycgi_tutorials/depictmatch.cgi with the SMILES of C[14CH2][13CH3] and SMARTS of [!0*] . Daylight matches 2 of the 3 atoms. I think this is a bug in RDKit, and once fixed it would mean this bit could be supported. * Bit 2 is #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0), # ISOTOPE Not complete 2:('[#103,#104]',0), # ISOTOPE Not complete I assume the comment is wrong, since this has nothing to do with isotopes. What's not complete about this definition, and/or why is the first one commented out? * *NOTE* spec wrong occurs on many lines What does it mean? * Bit 3 is 3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-6 (Ge...) *NOTE* spec wrong The Tl doesn't look right. Shouldn't the last three be Pb,Bi,Po ? * Bit 18 is 18:('[B,Al,Ga,In,Tl]',0), # Group IIIA (B...) *NOTE* spec wrong Boron may be aromatic according to the SMILES spec, so this should be [B,b, ...] or [#5, ... ]. Also, here's the aromatic elements in OpenBabel: [se] [as] [si] [ge] [sb] [bi] [te] [sn] Not all of these are valid SMARTS according to Daylight, and RDKit doesn't support the same set of aromatics, so for a portable version (which I'm working on) they can be written as [#34], [#33], [#14], ... Oh, and aromatic lead has been synthesized http://www.rsc.org/chemistryworld/News/2010/April/15041002.asp * Bit 44 is 44:('?',0), # OTHER Is this one of the undocumented bits or does OTHER mean something else? * Bit 68 says FIX: incomplete definition Are there thoughts to complete this? My thought is that it isn't important one way or the other. Without a good validation set it would be hard to really pin this down. There are a number of other bits which are also marked FIX: incomplete definition. Are they going to be fixed? Again, I don't think there's a pressing need without validation data. * Bit 101 says: 8M Ring or larger. This only handles up to ring sizes of 14 Is it worthwhile to support larger rings? I don't think so. If yes, then it could be dealt with outside of the SMARTS, just like 125 and 166. BTW, I also verified that all of the CH2 atoms were written as either [CH2] (if there are two bonds other atoms) or [C;H2,H3] if there is only one bond (and similar with [NH2]). While strange chemistries can cause this to fail as a substructure filter, I recognize that that is outside the scope of those definitions. Cheers, Andrew da...@dalkescientific.com -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
Hi Andrew, I'm going to divide this into pieces in order to be able to answer in a reasonable amount of time. I'll do clarifying questions and quick answers in this one. On Thu, May 26, 2011 at 4:02 PM, Andrew Dalke da...@dalkescientific.com wrote: RDKit implements the MACCS keys as a set of SMARTS patterns, plus a few bits coded by hand. I don't know how much people know the impact of this on the other free software projects. OpenBabel and CDK both use copies of the RDKit definitions for their own MACCS keys. While I've seen earlier internal definitions, they were held rather closely, so it's very nice to have a public definition. I'm reviewing the definitions as part of my chemfp project, which is one of the advantages to having an open definition. I've got some question or suggestions about them. For reference, see http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html * Bit 1 is 1:('?',0), # ISOTOPE This explicitly isn't defined, but shouldn't it be [!*0] ? I tried out that SMARTS and I see that a SMILES of C[14CH3] has two matches in RDKit to [!0*] but in OEChem there's only one. I think the OEChem version is correct. I verified it at http://www.daylight.com/daycgi_tutorials/depictmatch.cgi with the SMILES of C[14CH2][13CH3] and SMARTS of [!0*] . Daylight matches 2 of the 3 atoms. I think this is a bug in RDKit, and once fixed it would mean this bit could be supported. My reading of the SMARTS theory manual (http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html) says that [0*] means any atom with a mass of 0, so [!0*] would be any atom that doesn't have a mass of 0. What am I missing? * Bit 2 is #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0), # ISOTOPE Not complete 2:('[#103,#104]',0), # ISOTOPE Not complete I assume the comment is wrong, since this has nothing to do with isotopes. What's not complete about this definition, and/or why is the first one commented out? I've got to see if I can find a description of the bits and I'll come back to these definition questions. 18:('[B,Al,Ga,In,Tl]',0), # Group IIIA (B...) *NOTE* spec wrong Boron may be aromatic according to the SMILES spec, so this should be [B,b, ...] or [#5, ... ]. Also, here's the aromatic elements in OpenBabel: [se] [as] [si] [ge] [sb] [bi] [te] [sn] Not all of these are valid SMARTS according to Daylight, and RDKit doesn't support the same set of aromatics, so for a portable version (which I'm working on) they can be written as [#34], [#33], [#14], ... Oh, and aromatic lead has been synthesized http://www.rsc.org/chemistryworld/News/2010/April/15041002.asp Agreed that using the generic atomic-number form makes a lot more sense. * Bit 101 says: 8M Ring or larger. This only handles up to ring sizes of 14 Is it worthwhile to support larger rings? I don't think so. If yes, then it could be dealt with outside of the SMARTS, just like 125 and 166. Agreed that it's not really necessary to support larger rings. Systems with rings larger than 14 would end up missing a single bit. BTW, I also verified that all of the CH2 atoms were written as either [CH2] (if there are two bonds other atoms) or [C;H2,H3] if there is only one bond (and similar with [NH2]). While strange chemistries can cause this to fail as a substructure filter, I recognize that that is outside the scope of those definitions. it certainly is for me. :-) -greg -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
Hi Andrew, Second part of my response. On Thu, May 26, 2011 at 4:02 PM, Andrew Dalke da...@dalkescientific.com wrote: * Bit 2 is #2:('[#103,#104,#105,#106,#107,#106,#109,#110,#111,#112]',0), # ISOTOPE Not complete 2:('[#103,#104]',0), # ISOTOPE Not complete I assume the comment is wrong, since this has nothing to do with isotopes. What's not complete about this definition, and/or why is the first one commented out? You're right, the comment is wrong. The definition is also not correct, the key should be atomic num103. The reason the more complete defn is commented out is that the RDKit periodic table data only go up to #104. I added a comment to that effect. * *NOTE* spec wrong occurs on many lines What does it mean? I'm afraid that's lost in the sands of time. I will remove them. * Bit 3 is 3:('[Ge,As,Se,Sn,Sb,Te,Tl,Pb,Bi]',0), # Group IVa,Va,VIa Periods 4-6 (Ge...) *NOTE* spec wrong The Tl doesn't look right. Shouldn't the last three be Pb,Bi,Po ? Yep. * Bit 18 is 18:('[B,Al,Ga,In,Tl]',0), # Group IIIA (B...) *NOTE* spec wrong Boron may be aromatic according to the SMILES spec, so this should be [B,b, ...] or [#5, ... ]. Fixed this. * Bit 44 is 44:('?',0), # OTHER Is this one of the undocumented bits or does OTHER mean something else? It's undocumented * Bit 68 says FIX: incomplete definition Are there thoughts to complete this? This is one where the spec is incomplete : it includes the amazingly helpful (...) at the end. My thought is that it isn't important one way or the other. Without a good validation set it would be hard to really pin this down. Agreed. There are a number of other bits which are also marked FIX: incomplete definition. Are they going to be fixed? Again, I don't think there's a pressing need without validation data. Those also have (...). I've updated the comment to make clear that it's due to an incomplete spec. I just checked in a set of changes reflecting the above. -greg -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MACCS SMARTS pattern definitions
And now a more philosophical point about this. On Thu, May 26, 2011 at 4:02 PM, Andrew Dalke da...@dalkescientific.com wrote: RDKit implements the MACCS keys as a set of SMARTS patterns, plus a few bits coded by hand. I don't know how much people know the impact of this on the other free software projects. OpenBabel and CDK both use copies of the RDKit definitions for their own MACCS keys. While I've seen earlier internal definitions, they were held rather closely, so it's very nice to have a public definition. I'm reviewing the definitions as part of my chemfp project, which is one of the advantages to having an open definition. It seems like it would make a lot more sense for all of us if we had a truly open definition. We'll never get that with MACCS keys because there's no true public definition of the so-called public keys (at least not that I know of). The idea of the MACCS keys is simple: a limited set of structural keys that can be used to speed up substructure searches and which have since been (ab)used for chemical similarity. It seems like it would be a lot more helpful to the community if we had a set of keys like this that is based on a truly open definition. Given that we have the MACCS and Pubchem (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt) keys as templates, that there are ample publications in this space (including from MDL: http://pubs.acs.org/doi/abs/10.1021/ci010132r), and that Andrew has kind of already started working on this (info in this thread: http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg00402.html), it seems like it shouldn't be all that much work. What do you think Andrew? Want to work together on this? -greg -- vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss