Re: [OpenBabel-Devel] weekend SMILES canonicalization errors

Tim Vandermeersch Tue, 12 Oct 2010 10:29:05 -0700

On Tue, Oct 12, 2010 at 3:14 AM, Tim Vandermeersch
<tim.vandermeer...@gmail.com> wrote:
> On Mon, Oct 11, 2010 at 7:16 PM, Craig A. James <cja...@emolecules.com> wrote:
>> On 10/11/10 8:52 AM, Tim Vandermeersch wrote:
>>>
>>> I still need to figure out how to deal with metallocene compounds
>>> where there are 8 or more neighbors with the same symmetry class. I
>>> already have a hack to handle ferrocene but we might want to extend
>>> this. IIRC, this might also help kekulization?
>>>
>>> Metallocene: metal atom sandwiched between rings (4 or more atoms per
>>> ring)
>>> Normalization: Remove bonds connecting metal to ring atoms without
>>> increasing the number of disconnected fragments. Bonds will have to be
>>> sorted using symmetry classes to always remove the same bonds.
>>>
>>> This reduces the number of states for canonicalization dramatically.
>>> This also makes the smiles nicer since all the closure digits can be
>>> omitted.
>>>
>>> C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71  -->   C1=CC(C=C1)[Fe]C1C=CC=C1
>>>
>>> Does this sound like a reasonable solution?
>>
>> I think I'd vote for this, but there are some "philosophical" issues that it
>> raises regarding normalizations.
>>
>> On a practical side, I think this is an excellent idea. These metallocenes
>> can be an algorithmic quagmire that sucks good programmers into the mud.
>>  But if we go down this path, we have to ask much harder questions.
>>
>> The point of canonicalization is to generate a single SMILES for each
>> molecule for database purposes.  But when are two molecules the same and
>> when are they different?  That's a very hard question.
>>
>> If we start normalizing metallocenes, why not normalize nitro, phosphate and
>> sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.)
>>  What about tautomers?
>>
>> The Weiningers (Dave and Art) decided to put aromaticity in as part of the
>> definition of canonical SMILES because a kekule representation was worthless
>> for database use.  But his original database was only 25,000 compounds and
>> he was only concerned about cLogP calculations, so he left out these other
>> cases.  They just didn't matter.
>>
>> But in a modern cheminformatics system, they are equally problematic.  The
>> Weiningers solved the aromaticity problem, but left all the others "as an
>> exercise for the reader" (that would be us).
>>
>> The people behind InChI decided to handle more problems.  But they were
>> guided by their own internal requirements: to produce a consistent
>> nomenclature for IUPAC.  They were NOT trying to provide a useful
>> general-purpose solution for cheminformatics.
>>
>> So now the OpenBabel project and OpenSMILES definition are facing a problem:
>> How much normalization are we going to do?  Are we going to go just one step
>> further and decide that metallocenes should be normalized, but not nitro
>> groups or tautomers?  Or are we going to go all the way and define clear
>> standards for normalizing all of these problem cases?
>>
>> At Daylight, we came up with three levels of normalization:
>>
>>  Absolute SMILES: Includes stereochemistry and isotopic markings
>>  Unique SMILES: Excludes stereochemistry and isotopes
>>  Graph SMILES: All atoms are C, all bonds are single
>>
>> My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration
>> systems for GSK and several other large pharma companies.  He came up with a
>> whole set of rules for normalizations that includes all of these problem
>> cases, plus another layer of normalization:
>>
>>  Parent SMILES: Remove salts and solvates
>>
>> I would argue that if we're going to start doing more normalizations for
>> SMILES, we should be formal about it and establish three or four formal
>> levels of canonicalization, much like Daylight's.
>>
>> Craig
>>
>> P.S. I think I'll cross-post this to the Blue Obelisk mailing list.
>
> Ignoring metallocene bonds in the canonical coding algorithm seems to
> work great without modifying the molecule. Before making this change I
> got 17 errors in the first million molecules from the eMolecules db.
> These were all metallocenes and ignoring the bonds solves the problem
> for these. I'm shuffling the 1 million molecules again (20x each) and
> commit if there are no errors. I also have one or three uncommitted
> fixes here.
>
> Using 5 processes it took about 1h15m to run the test. I used a
> process memory limit of 250MB which is never exceed.


The changes are in svn trunk now. The first 1 million molecules from
eMolecules-2010-03-01.smi all pass the shuffle test. There is 1
timeout. I'm now running a single process in the background for all 5
million compounds.

Tim

> Tim
>

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] weekend SMILES canonicalization errors

Reply via email to