On 10/11/10 8:52 AM, Tim Vandermeersch wrote: > I still need to figure out how to deal with metallocene compounds > where there are 8 or more neighbors with the same symmetry class. I > already have a hack to handle ferrocene but we might want to extend > this. IIRC, this might also help kekulization? > > Metallocene: metal atom sandwiched between rings (4 or more atoms per ring) > Normalization: Remove bonds connecting metal to ring atoms without > increasing the number of disconnected fragments. Bonds will have to be > sorted using symmetry classes to always remove the same bonds. > > This reduces the number of states for canonicalization dramatically. > This also makes the smiles nicer since all the closure digits can be > omitted. > > C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71 --> C1=CC(C=C1)[Fe]C1C=CC=C1 > > Does this sound like a reasonable solution?
I think I'd vote for this, but there are some "philosophical" issues that it raises regarding normalizations. On a practical side, I think this is an excellent idea. These metallocenes can be an algorithmic quagmire that sucks good programmers into the mud. But if we go down this path, we have to ask much harder questions. The point of canonicalization is to generate a single SMILES for each molecule for database purposes. But when are two molecules the same and when are they different? That's a very hard question. If we start normalizing metallocenes, why not normalize nitro, phosphate and sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.) What about tautomers? The Weiningers (Dave and Art) decided to put aromaticity in as part of the definition of canonical SMILES because a kekule representation was worthless for database use. But his original database was only 25,000 compounds and he was only concerned about cLogP calculations, so he left out these other cases. They just didn't matter. But in a modern cheminformatics system, they are equally problematic. The Weiningers solved the aromaticity problem, but left all the others "as an exercise for the reader" (that would be us). The people behind InChI decided to handle more problems. But they were guided by their own internal requirements: to produce a consistent nomenclature for IUPAC. They were NOT trying to provide a useful general-purpose solution for cheminformatics. So now the OpenBabel project and OpenSMILES definition are facing a problem: How much normalization are we going to do? Are we going to go just one step further and decide that metallocenes should be normalized, but not nitro groups or tautomers? Or are we going to go all the way and define clear standards for normalizing all of these problem cases? At Daylight, we came up with three levels of normalization: Absolute SMILES: Includes stereochemistry and isotopic markings Unique SMILES: Excludes stereochemistry and isotopes Graph SMILES: All atoms are C, all bonds are single My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration systems for GSK and several other large pharma companies. He came up with a whole set of rules for normalizations that includes all of these problem cases, plus another layer of normalization: Parent SMILES: Remove salts and solvates I would argue that if we're going to start doing more normalizations for SMILES, we should be formal about it and establish three or four formal levels of canonicalization, much like Daylight's. Craig P.S. I think I'll cross-post this to the Blue Obelisk mailing list. ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel