On Tue, Oct 12, 2010 at 3:14 AM, Tim Vandermeersch <tim.vandermeer...@gmail.com> wrote: > On Mon, Oct 11, 2010 at 7:16 PM, Craig A. James <cja...@emolecules.com> wrote: >> On 10/11/10 8:52 AM, Tim Vandermeersch wrote: >>> >>> I still need to figure out how to deal with metallocene compounds >>> where there are 8 or more neighbors with the same symmetry class. I >>> already have a hack to handle ferrocene but we might want to extend >>> this. IIRC, this might also help kekulization? >>> >>> Metallocene: metal atom sandwiched between rings (4 or more atoms per >>> ring) >>> Normalization: Remove bonds connecting metal to ring atoms without >>> increasing the number of disconnected fragments. Bonds will have to be >>> sorted using symmetry classes to always remove the same bonds. >>> >>> This reduces the number of states for canonicalization dramatically. >>> This also makes the smiles nicer since all the closure digits can be >>> omitted. >>> >>> C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71 --> C1=CC(C=C1)[Fe]C1C=CC=C1 >>> >>> Does this sound like a reasonable solution? >> >> I think I'd vote for this, but there are some "philosophical" issues that it >> raises regarding normalizations. >> >> On a practical side, I think this is an excellent idea. These metallocenes >> can be an algorithmic quagmire that sucks good programmers into the mud. >> But if we go down this path, we have to ask much harder questions. >> >> The point of canonicalization is to generate a single SMILES for each >> molecule for database purposes. But when are two molecules the same and >> when are they different? That's a very hard question. >> >> If we start normalizing metallocenes, why not normalize nitro, phosphate and >> sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.) >> What about tautomers? >> >> The Weiningers (Dave and Art) decided to put aromaticity in as part of the >> definition of canonical SMILES because a kekule representation was worthless >> for database use. But his original database was only 25,000 compounds and >> he was only concerned about cLogP calculations, so he left out these other >> cases. They just didn't matter. >> >> But in a modern cheminformatics system, they are equally problematic. The >> Weiningers solved the aromaticity problem, but left all the others "as an >> exercise for the reader" (that would be us). >> >> The people behind InChI decided to handle more problems. But they were >> guided by their own internal requirements: to produce a consistent >> nomenclature for IUPAC. They were NOT trying to provide a useful >> general-purpose solution for cheminformatics. >> >> So now the OpenBabel project and OpenSMILES definition are facing a problem: >> How much normalization are we going to do? Are we going to go just one step >> further and decide that metallocenes should be normalized, but not nitro >> groups or tautomers? Or are we going to go all the way and define clear >> standards for normalizing all of these problem cases? >> >> At Daylight, we came up with three levels of normalization: >> >> Absolute SMILES: Includes stereochemistry and isotopic markings >> Unique SMILES: Excludes stereochemistry and isotopes >> Graph SMILES: All atoms are C, all bonds are single >> >> My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration >> systems for GSK and several other large pharma companies. He came up with a >> whole set of rules for normalizations that includes all of these problem >> cases, plus another layer of normalization: >> >> Parent SMILES: Remove salts and solvates >> >> I would argue that if we're going to start doing more normalizations for >> SMILES, we should be formal about it and establish three or four formal >> levels of canonicalization, much like Daylight's. >> >> Craig >> >> P.S. I think I'll cross-post this to the Blue Obelisk mailing list. > > Ignoring metallocene bonds in the canonical coding algorithm seems to > work great without modifying the molecule. Before making this change I > got 17 errors in the first million molecules from the eMolecules db. > These were all metallocenes and ignoring the bonds solves the problem > for these. I'm shuffling the 1 million molecules again (20x each) and > commit if there are no errors. I also have one or three uncommitted > fixes here. > > Using 5 processes it took about 1h15m to run the test. I used a > process memory limit of 250MB which is never exceed.
The changes are in svn trunk now. The first 1 million molecules from eMolecules-2010-03-01.smi all pass the shuffle test. There is 1 timeout. I'm now running a single process in the background for all 5 million compounds. Tim > Tim > ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel