Re: [OpenBabel-Devel] weekend SMILES canonicalization errors

Craig A. James Mon, 11 Oct 2010 10:16:58 -0700

On 10/11/10 8:52 AM, Tim Vandermeersch wrote:
> I still need to figure out how to deal with metallocene compounds
> where there are 8 or more neighbors with the same symmetry class. I
> already have a hack to handle ferrocene but we might want to extend
> this. IIRC, this might also help kekulization?
>
> Metallocene: metal atom sandwiched between rings (4 or more atoms per ring)
> Normalization: Remove bonds connecting metal to ring atoms without
> increasing the number of disconnected fragments. Bonds will have to be
> sorted using symmetry classes to always remove the same bonds.
>
> This reduces the number of states for canonicalization dramatically.
> This also makes the smiles nicer since all the closure digits can be
> omitted.
>
> C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71  -->   C1=CC(C=C1)[Fe]C1C=CC=C1
>
> Does this sound like a reasonable solution?


I think I'd vote for this, but there are some "philosophical" issues that it 
raises regarding normalizations.

On a practical side, I think this is an excellent idea. These metallocenes can 
be an algorithmic quagmire that sucks good programmers into the mud.  But if we 
go down this path, we have to ask much harder questions.

The point of canonicalization is to generate a single SMILES for each molecule 
for database purposes.  But when are two molecules the same and when are they 
different?  That's a very hard question.

If we start normalizing metallocenes, why not normalize nitro, phosphate and 
sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.)  What 
about tautomers?

The Weiningers (Dave and Art) decided to put aromaticity in as part of the 
definition of canonical SMILES because a kekule representation was worthless 
for database use.  But his original database was only 25,000 compounds and he 
was only concerned about cLogP calculations, so he left out these other cases.  
They just didn't matter.

But in a modern cheminformatics system, they are equally problematic.  The 
Weiningers solved the aromaticity problem, but left all the others "as an 
exercise for the reader" (that would be us).

The people behind InChI decided to handle more problems.  But they were guided 
by their own internal requirements: to produce a consistent nomenclature for 
IUPAC.  They were NOT trying to provide a useful general-purpose solution for 
cheminformatics.

So now the OpenBabel project and OpenSMILES definition are facing a problem: 
How much normalization are we going to do?  Are we going to go just one step 
further and decide that metallocenes should be normalized, but not nitro groups 
or tautomers?  Or are we going to go all the way and define clear standards for 
normalizing all of these problem cases?

At Daylight, we came up with three levels of normalization:

   Absolute SMILES: Includes stereochemistry and isotopic markings
   Unique SMILES: Excludes stereochemistry and isotopes
   Graph SMILES: All atoms are C, all bonds are single

My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration 
systems for GSK and several other large pharma companies.  He came up with a 
whole set of rules for normalizations that includes all of these problem cases, 
plus another layer of normalization:

   Parent SMILES: Remove salts and solvates

I would argue that if we're going to start doing more normalizations for 
SMILES, we should be formal about it and establish three or four formal levels 
of canonicalization, much like Daylight's.

Craig

P.S. I think I'll cross-post this to the Blue Obelisk mailing list.


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] weekend SMILES canonicalization errors

Reply via email to