This is just a quick summary of the technical problems I've discovered. I'm putting them here so that if I've overlooked something, or anyone has insights that I've missed, we can discuss them.
1. The "D" or "degree" in SMARTS [#6D2] is supposed to mean a a carbon atom with two bonds. However, it treats explicit hydrogens as a bond, so in the SMILES c1[nH]ccc1, the carbons have degree 2, and the nitrogen degree 3, even though in reality they all have three bonds. Worse, in [cH]1[nH][cH][cH][cH]1, none of the atoms are considered degree 2 so [#6D2] doesn't match, even though it's the exact same molecule as c1[nH]ccc1. This makes it very hard to write sensible SMARTS that define aromaticity. I'm planning to abandon the "D" notation, and instead use pairs of rules like this: [#6r](~...@*)(~...@*)~* 1 1 [#6rH1](~...@*)~...@* 1 1 For OpenBabel 3, we should clarify the definition of "D" in SMARTS so that it either does or does not include bonds to H atoms, regardless of whether the H are represented implicitely or explicitely in the internal C++ structures. 2. aromaticity.txt This is a mess. Because of other flaws in the system, Geoff (I think it was Geoff) was forced to add rules like these: [#7rD2] 1 2 [#7rD3] 1 2 [#7r](-...@*)-...@* 1 2 It fixed a bug (an aromatic ring that was missed), but it's is absurd. It's saying nitrogen with two bonds (single and/or double), OR with three bonds (single or double), OR with two single bonds (regardless of overall valence), can ALL deliver either one or two electrons to the aromatic system! Unfortunately, without these rules, OpenBabel misses some important aromaticity cases, but with these rules, it thinks some things are aromatic that plainly are not, like this: echo "c1[nH2]ccc1" | babel -i smi -o smi c1[nH2]ccc1 1 molecule converted A lot of this is workarounds for the other problems I'm outlining here; you just can't write sensible SMARTS that actually work. 3. Inexact electron contribution In aromatic.txt, each SMARTS pattern has a *range* of electrons it can contribute, as in the nitrogen examples in #2 above. This just seems wrong to me. For example, a nitrogen with three bonds ALWAYS contributes two electrons. If there is a tautomeric situation, like c[nH]c(=O)c, then you simply need to write a pair of SMARTS accordingly, so that your rule applies to that exact situation, like this: [#7rH](~...@*)[#6r](=O) 2 [#6r](=O)(~...@[#7rh])~...@* 0 I've asked around here, and none of the chemists can think of an example where the number of electron contributed by an atom is ambiguous. If you think, "it's 1 or 2", then you haven't defined the SMARTS well enough. So I propose to discard the range rules from aromatic.txt, and have each SMARTS specify a single number, the specific number of electrons contributed by that atom. 4. typer.cpp To fix the problems in aromaticity.txt, there are a number of functions in typer.cpp that try to identify the silly cases and remove them. This means that, when looking at the rules in aromaticity.txt, you can't tell for sure how or when they'll apply. You can carefully craft a SMARTS that you think will fix some problem, and it has no effect, because some code in typer.cpp rejects that case. I believe (but am not positive) that SMARTS should be sufficient to define all potentially aromatic atoms. We shouldn't need any special-case code. The SMARTS will define the electron count of each atom, then we apply Hueckel's 4n+2 rule, and that's it. 5. Implicit/explicit valence This is something we've discussed before. The orginal design of Babel didn't define these terms carefully, and the result is that they're used inconsistently. 6. Implicit/explicit hydrogens Like valence, the policy for implicit/explicit H wasn't defined clearly. It should be the case that there is no difference; it's irrelevant in ALL cases whether a H atom is an explicit C++ object rather than an H count on the heavy atom. I can't fix #5 and #6, but luckily, the SDF and SMILES parsers are pretty consistent, so for my purposes (cheminformatics), I should be able to get a clear idea what H-count and valence actually mean. My plan is to more-or-less rewrite the aromaticity code in typer.cpp from top to bottom, and rewrite all of the rules in aromatic.txt. There doesn't seem to be anything worth saving. I've already made significant progress on kekule.cpp: It can now assign single/double bonds to C60 fullerene in less than 400 msec. The problems that remain are related to the SMARTS situation outlined above. Craig ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel