On Mon, Oct 11, 2010 at 7:16 PM, Craig A. James <cja...@emolecules.com> wrote: > On 10/11/10 8:52 AM, Tim Vandermeersch wrote: >> >> I still need to figure out how to deal with metallocene compounds >> where there are 8 or more neighbors with the same symmetry class. I >> already have a hack to handle ferrocene but we might want to extend >> this. IIRC, this might also help kekulization? >> >> Metallocene: metal atom sandwiched between rings (4 or more atoms per >> ring) >> Normalization: Remove bonds connecting metal to ring atoms without >> increasing the number of disconnected fragments. Bonds will have to be >> sorted using symmetry classes to always remove the same bonds. >> >> This reduces the number of states for canonicalization dramatically. >> This also makes the smiles nicer since all the closure digits can be >> omitted. >> >> C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71 --> C1=CC(C=C1)[Fe]C1C=CC=C1 >> >> Does this sound like a reasonable solution? > > I think I'd vote for this, but there are some "philosophical" issues that it > raises regarding normalizations. > > On a practical side, I think this is an excellent idea. These metallocenes > can be an algorithmic quagmire that sucks good programmers into the mud. > But if we go down this path, we have to ask much harder questions. > > The point of canonicalization is to generate a single SMILES for each > molecule for database purposes. But when are two molecules the same and > when are they different? That's a very hard question. > > If we start normalizing metallocenes, why not normalize nitro, phosphate and > sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.) > What about tautomers? > > The Weiningers (Dave and Art) decided to put aromaticity in as part of the > definition of canonical SMILES because a kekule representation was worthless > for database use. But his original database was only 25,000 compounds and > he was only concerned about cLogP calculations, so he left out these other > cases. They just didn't matter. > > But in a modern cheminformatics system, they are equally problematic. The > Weiningers solved the aromaticity problem, but left all the others "as an > exercise for the reader" (that would be us). > > The people behind InChI decided to handle more problems. But they were > guided by their own internal requirements: to produce a consistent > nomenclature for IUPAC. They were NOT trying to provide a useful > general-purpose solution for cheminformatics. > > So now the OpenBabel project and OpenSMILES definition are facing a problem: > How much normalization are we going to do? Are we going to go just one step > further and decide that metallocenes should be normalized, but not nitro > groups or tautomers? Or are we going to go all the way and define clear > standards for normalizing all of these problem cases? > > At Daylight, we came up with three levels of normalization: > > Absolute SMILES: Includes stereochemistry and isotopic markings > Unique SMILES: Excludes stereochemistry and isotopes > Graph SMILES: All atoms are C, all bonds are single > > My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration > systems for GSK and several other large pharma companies. He came up with a > whole set of rules for normalizations that includes all of these problem > cases, plus another layer of normalization: > > Parent SMILES: Remove salts and solvates > > I would argue that if we're going to start doing more normalizations for > SMILES, we should be formal about it and establish three or four formal > levels of canonicalization, much like Daylight's. > > Craig > > P.S. I think I'll cross-post this to the Blue Obelisk mailing list.
Ignoring metallocene bonds in the canonical coding algorithm seems to work great without modifying the molecule. Before making this change I got 17 errors in the first million molecules from the eMolecules db. These were all metallocenes and ignoring the bonds solves the problem for these. I'm shuffling the 1 million molecules again (20x each) and commit if there are no errors. I also have one or three uncommitted fixes here. Using 5 processes it took about 1h15m to run the test. I used a process memory limit of 250MB which is never exceed. Tim ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel