Hi Noel and Geoff,

I've been investigating some of the weird SMILES strings distributed by eMolecules,
that can't be read into other cheminformatics packages.  A significant fraction appear
to be molecules with nonsense formal charges on aromatic atoms, which then fail to
be Kekulized given the mismatched valence states.

Two examples include:
c1ccc2c(c1)[n+2](c(CO)c(CO)[n+]2[O-])O

and

Fc1c(F)c(F)[c+7](c(c1F)F)[Ti]1234([C+6]5C=CC=C5)([C+6]5[C+7]3=[C+7]2[C+7]1=[C+7]45)[c+7]1c(F)c(F)c(c(c1F)F)F 44386258

The second example is complete nonsense, because as explained on my "can't break the
laws of physics" blog post, a carbon can't possibly have a formal charge of +7 with only
six protons.  Given this brokenness, they shouldn't be getting marked as aromatic.

Digging deeper into where these atoms may be getting classified as aromatic led me
to OpenBabel's aromatic.txt and indeed the current trunk version of openbabel will
blindly transform "c1ccc2c(c1)[N+](=C(C(=[N+2]2O)CO)CO)[O-]" into the first string
above.

The problem appears to be that the current SMARTS patterns in aromatic.txt are too forgiving,
allowing any formal charge to be accepted as aromatic.  I suspect the pattern's author may have
assumed that, like SMILES, not specifying a formal charge implies no charge.  Indeed, the 
OpenSMILES specification implicitly repeats this, by listing the SMILES but not the SMARTS.

The attached patch resolves the issue, by tightening these SMARTS patterns.  The relevant
ideology is "first do no harm"; a ring system shouldn't be considered aromatic unless we
can be certain we can correctly Kekulize it back at some point in the future.  "[n+2]", if it did
exist and was allowed (it isn't on Daylight SMILES, c.f. daycgi/depict), should be isoelectronic
with boron, three valent, potentially aromatic, but contributing zero pi-electrons.

I've also noticed that "genheaders.sh" hasn't been run since some of the most recent changes
to the data/*.txt files, meaning some of the data/*.h files are out of sync.  If this proposed patch
gets accepted, running genheader.sh to regenerate aromatic.h would also address this.

Please let me know what you think?

Roger
--
Roger Sayle, Ph.D.
CEO and founder
NextMove Software Limited
Registered in England No. 07588305
Registered Office: Innovation Centre (Unit 23), Cambridge Science Park, Cambridge CB4 0EY

##############################################################################
#                                                                            #
#                    Open Babel file: aromatic.txt                           #
#                                                                            #
#                                                                            #
#  Copyright (c) 1998-2001 by OpenEye Scientific Software, Inc.              #
#  Some portions Copyright (c) 2001-2005 Geoffrey R. Hutchison               #
#  Part of the Open Babel package, under the GNU General Public License (GPL)#
#                                                                            #
# SMARTS patterns with minimum and maximum pi-electrons contributed to an    #
#   aromatic system (used by typer.cpp:OBAromaticTyper)                      #
# The LAST PATTERN MATCHED is used to assign values, so that patterns should #
#   be ordered from more general to more specific                            #
#                                                                            #
##############################################################################

#PATTERN                MIN     MAX

#carbon patterns
[#6rD2+0]               1       1
# exo ketone or alcohol -- don't know which
[#6rD3+0]~!@[#8]        0       1
[#6rD2+,#6rD3+]         1       1
[#6r+0]=@*              1       1
[#6rD3+0]=!@*           1       1
# external double bonds to hetero atoms contribute no electrons to the 
# aromatic systems -- quinoid systems are non-aromatic, e.g. 1,4-benzoquinone
[#6rD3+0]=!@[!#6]       0       0
[#6rD3-]                2       2

#nitrogen patterns
[#7rD2+0]               1       2
[#7rD3+0]               1       2
[#7r+0](-@*)-@*         1       2
[#7rD2+0]=@*            1       1
[#7rD3+]                1       1
[#7rD3+0]=O             1       1
[#7rD2-]                2       2

#oxygen patterns
[#8r+0]                 2       2
[#8r+]                  1       1

#sulfur patterns
[#16rD2+0]              2       2
[#16rD2+]               1       1
[#16rD3+0]=!@O          2       2

#other misc patterns
# Accounts Chem Res 1978 11 p. 153
# phosphole, phosphabenzene (not v. aromatic)
[#15rD3+0]              2       2
# selenophene
[#34rD2+0]              2       2
# arsabenzene, etc. (*really* not v. aromatic)
#[#33rD3+0]             2       2
# tellurophene, etc. (*really* not v. aromatic)
#[#52rD2+0]             2       2
# stilbabenzene, etc. (very little aromatic character)
#[#51rD3+0]             2       2

Attachment: aromatic.txt.patch
Description: Binary data


------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Reply via email to