Re: [OpenBabel-Devel] Bad canonicalization

Craig A. James Tue, 05 Oct 2010 09:59:25 -0700

On 10/5/10 1:06 AM, Noel O'Boyle wrote:
> On 5 October 2010 02:58, Geoffrey Hutchison<ge...@geoffhutchison.net>  wrote:
>>
>> On Oct 4, 2010, at 5:15 PM, Craig A. James wrote:
>>
>>> Something has gone badly wrong with the canonicalizer -- these are awful 
>>> SMILES.  The earlier version had rules that made for "nice looking" SMILES. 
>>>  These are a mess.  I realize that the whole canonicalizer was rewritten 
>>> for good reason, but we've now lost critical functionality.
>>
>> Do you think you can grab a list of some of the "nice looking" SMILES?
>
> I know what Craig is talking about - I noticed this myself. I
> understood that the change in canonical numbering was introduced to
> overcome canonicalisation problems (just a few weeks ago actually).
> The question is, can we have the best of both worlds - correct
> canonicalisation and nice SMILES? To be honest, I'm happy with correct
> canonicalisation but I guess there's no harm having both if we can.


It's actually pretty easy to achieve.  There are just a few rules you need.

When assigning the initial graph-invarients:

1. Favor terminal atoms.  The old algorithm used a "longest path to the edge" 
measure: from this atom, how far away is the farthest atom?  The ones with the 
longest paths are favored.  This tends to make SMILES that start on long 
chains, such as "CCCCc1ccccc1" rather than starting on a ring atom.

2. Favor fewer bonds, since this will also put terminal atoms.

3. Favor low atomic number, which will tend to put N, O and C first rather than 
metals, halogens and so forth.

These rules alone would solve most of the canonicalization problems.

The rest of the "make it nice looking" rules are in the SMILES writer itself 
and don't depend on the canonical numbering.

For example, you can canonicalize the fragments separately, then write them out 
longest-to-shortest (which puts salts and solvates at the end), and if two 
fragments are the same number of characters, write them out alphabetically.  
You get a nice canonicalization, but it's not part of the canonical labeling.

Similarly, the decision to write cis as '/...\' versus '\.../' is made by the 
SMILES writer, and the decision whether to write "c1ccccc1c1ccccc1" versus 
"c1ccccc1c2ccccc2" is in the SMILES writer code.

Most of the "prettiness" comes from smilesformat.c, but it starts with a few 
rules in canon.cpp.

Craig


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] Bad canonicalization

Reply via email to