Re: [OpenBabel-Devel] weekend SMILES canonicalization errors

Tim Vandermeersch Mon, 11 Oct 2010 18:14:56 -0700

On Mon, Oct 11, 2010 at 7:16 PM, Craig A. James <[email protected]> wrote:
> On 10/11/10 8:52 AM, Tim Vandermeersch wrote:
>>
>> I still need to figure out how to deal with metallocene compounds
>> where there are 8 or more neighbors with the same symmetry class. I
>> already have a hack to handle ferrocene but we might want to extend
>> this. IIRC, this might also help kekulization?
>>
>> Metallocene: metal atom sandwiched between rings (4 or more atoms per
>> ring)
>> Normalization: Remove bonds connecting metal to ring atoms without
>> increasing the number of disconnected fragments. Bonds will have to be
>> sorted using symmetry classes to always remove the same bonds.
>>
>> This reduces the number of states for canonicalization dramatically.
>> This also makes the smiles nicer since all the closure digits can be
>> omitted.
>>
>> C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71  -->   C1=CC(C=C1)[Fe]C1C=CC=C1
>>
>> Does this sound like a reasonable solution?
>
> I think I'd vote for this, but there are some "philosophical" issues that it
> raises regarding normalizations.
>
> On a practical side, I think this is an excellent idea. These metallocenes
> can be an algorithmic quagmire that sucks good programmers into the mud.
>  But if we go down this path, we have to ask much harder questions.
>
> The point of canonicalization is to generate a single SMILES for each
> molecule for database purposes.  But when are two molecules the same and
> when are they different?  That's a very hard question.
>
> If we start normalizing metallocenes, why not normalize nitro, phosphate and
> sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.)
>  What about tautomers?
>
> The Weiningers (Dave and Art) decided to put aromaticity in as part of the
> definition of canonical SMILES because a kekule representation was worthless
> for database use.  But his original database was only 25,000 compounds and
> he was only concerned about cLogP calculations, so he left out these other
> cases.  They just didn't matter.
>
> But in a modern cheminformatics system, they are equally problematic.  The
> Weiningers solved the aromaticity problem, but left all the others "as an
> exercise for the reader" (that would be us).
>
> The people behind InChI decided to handle more problems.  But they were
> guided by their own internal requirements: to produce a consistent
> nomenclature for IUPAC.  They were NOT trying to provide a useful
> general-purpose solution for cheminformatics.
>
> So now the OpenBabel project and OpenSMILES definition are facing a problem:
> How much normalization are we going to do?  Are we going to go just one step
> further and decide that metallocenes should be normalized, but not nitro
> groups or tautomers?  Or are we going to go all the way and define clear
> standards for normalizing all of these problem cases?
>
> At Daylight, we came up with three levels of normalization:
>
>  Absolute SMILES: Includes stereochemistry and isotopic markings
>  Unique SMILES: Excludes stereochemistry and isotopes
>  Graph SMILES: All atoms are C, all bonds are single
>
> My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration
> systems for GSK and several other large pharma companies.  He came up with a
> whole set of rules for normalizations that includes all of these problem
> cases, plus another layer of normalization:
>
>  Parent SMILES: Remove salts and solvates
>
> I would argue that if we're going to start doing more normalizations for
> SMILES, we should be formal about it and establish three or four formal
> levels of canonicalization, much like Daylight's.
>
> Craig
>
> P.S. I think I'll cross-post this to the Blue Obelisk mailing list.


Ignoring metallocene bonds in the canonical coding algorithm seems to
work great without modifying the molecule. Before making this change I
got 17 errors in the first million molecules from the eMolecules db.
These were all metallocenes and ignoring the bonds solves the problem
for these. I'm shuffling the 1 million molecules again (20x each) and
commit if there are no errors. I also have one or three uncommitted
fixes here.

Using 5 processes it took about 1h15m to run the test. I used a
process memory limit of 250MB which is never exceed.

Tim

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] weekend SMILES canonicalization errors

Reply via email to