This is just a quick summary of the technical problems I've discovered.  I'm 
putting them here so that if I've overlooked something, or anyone has insights 
that I've missed, we can discuss them.

1. The "D" or "degree" in SMARTS

[#6D2] is supposed to mean a a carbon atom with two bonds.  However, it treats 
explicit hydrogens as a bond, so in the SMILES c1[nH]ccc1, the carbons have 
degree 2, and the nitrogen degree 3, even though in reality they all have three 
bonds.  Worse, in [cH]1[nH][cH][cH][cH]1, none of the atoms are considered 
degree 2 so [#6D2] doesn't match, even though it's the exact same molecule as 
c1[nH]ccc1.

This makes it very hard to write sensible SMARTS that define aromaticity.  I'm 
planning to abandon the "D" notation, and instead use pairs of rules like this:

  [#6r](~...@*)(~...@*)~*       1       1
  [#6rH1](~...@*)~...@*         1       1

For OpenBabel 3, we should clarify the definition of "D" in SMARTS so that it 
either does or does not include bonds to H atoms, regardless of whether the H 
are represented implicitely or explicitely in the internal C++ structures.

2. aromaticity.txt

This is a mess.  Because of other flaws in the system, Geoff (I think it was 
Geoff) was forced to add rules like these:

  [#7rD2]                 1       2
  [#7rD3]                 1       2
  [#7r](-...@*)-...@*           1       2

It fixed a bug (an aromatic ring that was missed), but it's is absurd.  It's 
saying nitrogen with two bonds (single and/or double), OR with three bonds 
(single or double), OR with two single bonds (regardless of overall valence), 
can ALL deliver either one or two electrons to the aromatic system!  
Unfortunately, without these rules, OpenBabel misses some important aromaticity 
cases, but with these rules, it thinks some things are aromatic that plainly 
are not, like this:

  echo "c1[nH2]ccc1" | babel -i smi -o smi
  c1[nH2]ccc1
  1 molecule converted

A lot of this is workarounds for the other problems I'm outlining here; you 
just can't write sensible SMARTS that actually work.

3. Inexact electron contribution

In aromatic.txt, each SMARTS pattern has a *range* of electrons it can 
contribute, as in the nitrogen examples in #2 above.  This just seems wrong to 
me.

For example, a nitrogen with three bonds ALWAYS contributes two electrons.  If 
there is a tautomeric situation, like c[nH]c(=O)c, then you simply need to 
write a pair of SMARTS accordingly, so that your rule applies to that exact 
situation, like this:

  [#7rH](~...@*)[#6r](=O)     2
  [#6r](=O)(~...@[#7rh])~...@*   0

I've asked around here, and none of the chemists can think of an example where 
the number of electron contributed by an atom is ambiguous.  If you think, 
"it's 1 or 2", then you haven't defined the SMARTS well enough.

So I propose to discard the range rules from aromatic.txt, and have each SMARTS 
specify a single number, the specific number of electrons contributed by that 
atom.

4. typer.cpp

To fix the problems in aromaticity.txt, there are a number of functions in 
typer.cpp that try to identify the silly cases and remove them.  This means 
that, when looking at the rules in aromaticity.txt, you can't tell for sure how 
or when they'll apply.  You can carefully craft a SMARTS that you think will 
fix some problem, and it has no effect, because some code in typer.cpp rejects 
that case.

I believe (but am not positive) that SMARTS should be sufficient to define all 
potentially aromatic atoms.  We shouldn't need any special-case code.  The 
SMARTS will define the electron count of each atom, then we apply Hueckel's 
4n+2 rule, and that's it.

5. Implicit/explicit valence

This is something we've discussed before.  The orginal design of Babel didn't 
define these terms carefully, and the result is that they're used 
inconsistently.

6. Implicit/explicit hydrogens

Like valence, the policy for implicit/explicit H wasn't defined clearly.  It 
should be the case that there is no difference; it's irrelevant in ALL cases 
whether a H atom is an explicit C++ object rather than an H count on the heavy 
atom.


I can't fix #5 and #6, but luckily, the SDF and SMILES parsers are pretty 
consistent, so for my purposes (cheminformatics), I should be able to get a 
clear idea what H-count and valence actually mean.  My plan is to more-or-less 
rewrite the aromaticity code in typer.cpp from top to bottom, and rewrite all 
of the rules in aromatic.txt.  There doesn't seem to be anything worth saving.

I've already made significant progress on kekule.cpp: It can now assign 
single/double bonds to C60 fullerene in less than 400 msec.  The problems that 
remain are related to the SMARTS situation outlined above.

Craig

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Reply via email to