Re: [OpenBabel-Devel] [BlueObelisk-SMILES] SMILES normalizations: metallocenes and other problems (from OpenBabel-devel)

Tim Vandermeersch Mon, 11 Oct 2010 11:31:48 -0700

On Mon, Oct 11, 2010 at 7:20 PM, Craig James <craig_ja...@emolecules.com> wrote:
> This is a cross-post from the OpenBabel-devel mailing list.
>
> On 10/11/10 8:52 AM, Tim Vandermeersch wrote:
>> I still need to figure out how to deal with metallocene compounds
>> where there are 8 or more neighbors with the same symmetry class. I
>> already have a hack to handle ferrocene but we might want to extend
>> this. IIRC, this might also help kekulization?
>>
>> Metallocene: metal atom sandwiched between rings (4 or more atoms per ring)
>> Normalization: Remove bonds connecting metal to ring atoms without
>> increasing the number of disconnected fragments. Bonds will have to be
>> sorted using symmetry classes to always remove the same bonds.
>>
>> This reduces the number of states for canonicalization dramatically.
>> This also makes the smiles nicer since all the closure digits can be
>> omitted.
>>
>> C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71  -->   C1=CC(C=C1)[Fe]C1C=CC=C1
>>
>> Does this sound like a reasonable solution?
>
> I think I'd vote for this, but there are some "philosophical" issues that it 
> raises regarding normalizations.


> On a practical side, I think this is an excellent idea. These metallocenes 
> can be an algorithmic quagmire that sucks good programmers into the mud.  But 
> if we go down this path, we have to ask much harder questions.

I can probably find another way to deal with these structures but we
want to get OB released at some point :-)

> The point of canonicalization is to generate a single SMILES for each 
> molecule for database purposes.  But when are two molecules the same and when 
> are they different?  That's a very hard question.

Yes, for example do I have this in stock, ... There are many ways to
draw ferrocene: 2 single bonds to ring atoms, 2 single bonds to ring
centroids (not sure how we handle this currently), 10 bonds, ... All
of these would need to be normalized.

> If we start normalizing metallocenes, why not normalize nitro, phosphate and 
> sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.)  What 
> about tautomers?

All of these should be done if the user requests this. However, for
the OpenBabel 2.3 release, we canonicalize the structure without
additional normalization. For future versions, I think we should have
normalization plugins that can be enabled etc. I would like to add
good support for tautomers but this alone is already a reasonable big
task.

> The Weiningers (Dave and Art, and father Joseph contributed too) decided to 
> put aromaticity in as part of the definition of canonical SMILES because a 
> kekule representation was worthless for database use.  But his original 
> database was only 25,000 compounds and he was only concerned about cLogP 
> calculations, so he left out these other cases.  They just didn't matter.
>
> But in a modern cheminformatics system, they are equally problematic.  The 
> Weiningers solved the aromaticity problem, but left all the others "as an 
> exercise for the reader" (that would be us).
>
> The InChI team decided to handle more problems.  But they were guided by 
> their own internal requirements: to produce a consistent nomenclature for 
> IUPAC.  They were NOT trying to provide a useful general-purpose solution for 
> cheminformatics.

Yes, the InChi has extensive normalization. This would be a good starting point.

> So now the OpenBabel project and OpenSMILES definition are facing a problem: 
> How much normalization are we going to do?  Are we going to go just one step 
> further and decide that metallocenes should be normalized, but not nitro 
> groups or tautomers?  Or are we going to go all the way and define clear 
> standards for normalizing all of these problem cases?
>
> At Daylight, we came up with three levels of normalization:
>
>   Absolute SMILES: Includes stereochemistry and isotopic markings
>   Unique SMILES: Excludes stereochemistry and isotopes
>   Graph SMILES: All atoms are C, all bonds are single

This is similar to the InChi layers and our canonical coding also does
this although it's not an option yet. See below.

> My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration 
> systems for GSK and several other large pharma companies.  He came up with a 
> whole set of rules for normalizations that includes all of these problem 
> cases, plus another layer of normalization:
>
>   Parent SMILES: Remove salts and solvates

If we have normalization plugins, it would be easy to do all this.

> I would argue that if we're going to start doing more normalizations for 
> SMILES, we should be formal about it and establish three or four formal 
> levels of canonicalization, much like Daylight's.

The canonical code we produce is a list of numbers which is just a set
of joined smaller lists.

Bonds are encoded by a FROM and CLOSURE list. This is the topology of
the molecule as a graph. This depends on symmetry classes and is not
the same as all carbon/single bonds but this should be an option in
OBGraphSym.
Atom and bond types are the next layers (ATOM-TYPES & BOND-TYPES)
CHARGES layer if needed
The next layers could be made optional: ISOTOPES, STEREO

I'll add a function parameter to allow for these layers to be set
using bit-ORed flags and take care of dependencies. This can be done
for 2.3.

To conclude, I totally agree with all of this but it is beyond the
scope of OB 2.3.

Tim

> Craig
>
> P.S. I think I'll cross-post this to the Blue Obelisk mailing list.
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________
> Blueobelisk-SMILES mailing list
> blueobelisk-smi...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/blueobelisk-smiles
>

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] [BlueObelisk-SMILES] SMILES normalizations: metallocenes and other problems (from OpenBabel-devel)

Reply via email to