On Dec 10, 2009, at 5:56 PM, Peter Murray-Rust wrote:
> The SMILES spec has been published though there are some gray areas - like
> how aromaticity and tautomers are calculated. The answer here is "whatever
> the current commercial program emits".
That is not correct. SMILES and the Daylight toolkit do not handle tautomers at
all upon input or normal processing. Hydrogens are presumed to be exactly as
they are specified on input.
A SMILES parser may upon input change the perceived aromaticity under the
presumption of a certain chemistry model, but that is post data exchange.
That is, "C1=CC=CC=C1" is different from "c1ccccc1" until the application of a
chemistry model. For that matter, "C1.C1" is different from "CC" even though
they are representation of the same molecule.
> That makes the definition of aromaticity and tautomerism not open - it can
> only be determined by reading the code and that is not universally available.
But the definitions of aromaticity are also not universally agreed upon.
OpenBabel and CDK and Daylight (and multiple versions of Daylight) and OpenEye
and every other toolkit has a different chemistry model. OpenEye even
implements several different models, to handle different ideas of chemistry.
I believe that CDK, RDKit, and OpenBabel and a few other toolkits are wrong, in
that they attempt to enforce a chemistry model upon input. This is what the
original SMILES paper stated, and the Daylight toolkit implements. I believe
OpenEye is one of the few toolkits to handle it correctly, which is to say that
SMILES is a representation for a certain valence-bond model of chemistry.
SMILES as a way to exchange information about a valance bond model. Chemistry
perception, canonicalization, tautomer perception, etc. are downstream
questions which are not a part of the format specification.
> CanonicalSMILES is a major problem (if it wasn't then I suspect InChI would
> not have been developed).
That is possibly correct, but InChI has different goals:
- a multiple level scheme, which allows text searches
even for the molecular formula (not possible with SMILES)
- automatic identification and removal of salts
- formalization of a standard tautomer form (not part of SMILES)
- no need for atom-level charge assignment (not part of the chemistry
model used in SMILES)
- support for extensions to include new data values (eg, coordinates)
SMILES does not handle any of these, and the existence of a published form of
the Daylight canonicalization algorithm would not change that. In addition, the
authors of InChI used the algorithm of .... forgot the name. Researcher in
either Australia or New Zealand ... to generate the canonical form, and assert
that it is better than the Morgan-derived algorithm used by Daylight.
In addition, InChI is *NOT* meant as a structure exchange format. It is a
unique identifier for a structure. The ability to pass an InChI as a structure
is a side-effect of the software implementation and not a primary goal.
That is, I presume, one reason why the mechanism for producing an InChI is not
documented, outside of the source code itself.
> CanonicalSMILES is therefore only usable by the small subet of Daylight
> subscribers.
And irrelevant for the point of data exchange.
> I do not know what Andrew's problem is with CML. It's published.
My problems is that there is no clear license for it.
> The schema is on Sourceforge. The JUMBO code which is meant to be a reference
> is Open.
The license for JUMBO is "Artistic License", when it's supposed to be "2.0",
it's not clear from reading the 2.0 license that it is compatible with the
LGPL, and the README for JUMBO says that the "CML Schema is distributed under a
Creative Commons license, allowing redistribution but NOT derivative works" but
there are two different CC licenses which meet that criterion.
Plus, to find that license for CML took a very involved search, since it's not
clear why someone like me, a Python programmer, should download a Java package
in order to find the license for a XSD file.
Cheers,
Andrew
[email protected]
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Blueobelisk-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss