I decided to check this out a bit more. I have a test set of 18000 3D
structures I've discussed previously on this list. In short, I compare
the canonical smiles from a 3D structure to that for the structure
obtained when roundtripped through smiles. This highlights a variety
of errors, one of which is kekulization problems (typically manifested
as nH vs n).

On 22x, we have 190 errors. On trunk 136. On trunk with 22x's
kekulize, we have 334.

So, I was wrong. Trunk is doing better than 22x.

- Noel

2009/10/3 Noel O'Boyle <baoille...@gmail.com>:
> I'm finding the same, that kekulization is worse on the trunk than in
> 2.2.3. I'll probably revert locally to using the code on the branch.
>
> - Noel
>
> 2009/10/2 Nick England <nickengl...@gmail.com>:
>> I've been converting a load of mol files to smiles. The mol files are
>> all without hydrogens, and turn into smiles. However, I then converted
>> these smiles back into smiles (simply to aggregate a number of files
>> into one large one) and 2 of the smiles become wrong.
>>
>> The culprits are:
>> Before:
>> c12c3c(C(=O)c1cc(cc2)C=C)cccc3 2-Vinyl-9H-fluoren-9-one
>> [s]1n[s]n1 1H,3H-1,3,2,4-Dithiadiazete
>>
>> After:
>> C12=C3C(=CC=CC3)C(=O)C1C=C(C=C2)C=C 2-Vinyl-9H-fluoren-9-one
>> S1NSN1 1H,3H-1,3,2,4-Dithiadiazete
>>
>> This is using version 2.2.99.
>>
>> As you can see the first s miles has had 2 H added to it. The second
>> smiles isn't even valid according to depict. Once it is loaded back
>> into babel and out again as a smiles, its seems to have swapped the
>> hydrogens around from what was intended in the origional mol file. (or
>> possibly the mol file is wrong, I'm not sure about the numbering of
>> Dithiadiazete compounds.)
>>
>> The mol files these came from are given below. The first problem
>> molecule is probably related to the aromatic smiles bug, the second
>> problem molecule is outputting an invalid Smiles according to daylight
>> depict.
>>
>> I'll have a quick root around and see if I can find the bug.
>>
>> Nick England
>>
>> 00798725K
>>
>> csCF900/09280922592D
>>
>>
>>
>>  16 18  0  0  0  0  0  0  0  0999 V2000
>>
>>    4.4563   -0.0666    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    3.6473    0.5212    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    5.2653    0.5212    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    4.5608   -1.0611    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    3.9563    1.4723    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    2.6690    0.3133    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    4.9563    1.4723    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    6.1788    0.1145    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    5.4744   -1.4678    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    3.2872    2.2153    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    2.0000    1.0564    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    5.5441    2.2813    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    6.2834   -0.8800    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    2.3090    2.0075    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    7.1969   -1.2868    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    7.3015   -2.2813    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>  1  2  1  0  0  0  0
>>
>>  1  3  2  0  0  0  0
>>
>>  1  4  1  0  0  0  0
>>
>>  2  5  2  0  0  0  0
>>
>>  2  6  1  0  0  0  0
>>
>>  3  7  1  0  0  0  0
>>
>>  3  8  1  0  0  0  0
>>
>>  4  9  2  0  0  0  0
>>
>>  5  7  1  0  0  0  0
>>
>>  5 10  1  0  0  0  0
>>
>>  6 11  2  0  0  0  0
>>
>>  7 12  2  0  0  0  0
>>
>>  8 13  2  0  0  0  0
>>
>>  9 13  1  0  0  0  0
>>
>>  10 14  2  0  0  0  0
>>
>>  11 14  1  0  0  0  0
>>
>>  13 15  1  0  0  0  0
>>
>>  15 16  2  0  0  0  0
>>
>> M  END
>>
>> 01513195K
>>
>> csCF900/09290901162D
>>
>>
>>
>>  4  4  0  0  0  0  0  0  0  0999 V2000
>>
>>    2.7071   -0.7070    0.0000 S   0  0  0  0  0  4  0  0  0  0  0  0
>>
>>    2.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    3.4142    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>    2.7071    0.7070    0.0000 S   0  0  0  0  0  4  0  0  0  0  0  0
>>
>>  1  2  1  0  0  0  0
>>
>>  1  3  2  0  0  0  0
>>
>>  2  4  2  0  0  0  0
>>
>>  3  4  1  0  0  0  0
>>
>> M  END
>>
>>
>>
>>
>> 2009/7/21 Yongjin Xu <yongjin...@gmail.com>:
>>> One more example may be related:
>>> C1=S=NOC1=O   -->  c1snoc1=O
>>> the output SMILES is wrong based on daylight parser. actually the output
>>> is ambiguous since there may be several ways of double bond placement and
>>> also the oxidation states of S. this information just get removed during the
>>> conversion.
>>> By the way, does anyone know why I am getting a segmentation error when I
>>> try to use Separate() in OBMol, when the molecule I passed to is something
>>> like:  c1ccccc1.Cc1cccnc1
>>> Thanks
>>> Yongjin
>>>
>>> On Tue, Jul 21, 2009 at 7:32 AM, Craig A. James <cja...@emolecules.com>
>>> wrote:
>>>>
>>>> Noel O'Boyle wrote:
>>>> > I would say that the evidence could just as well point to a
>>>> > kekulization bug. It should be easy though to rule in/out a smiles
>>>> > parser error, right?
>>>>
>>>> Well, it's not parentheses. Another beautiful theory shot down by ugly
>>>> data:
>>>>
>>>> c1c2nonc2ccc1  ==>  c1ccc2nonc2c1
>>>> c1cc2nonc2cc1  ==>  C1CCC2NONC2C1
>>>>
>>>> This is the simplest example yet.  I'll keep digging.
>>>>
>>>> I'm now going on the theory that ring-closure parsing sets the internal
>>>> state variables (_order, _aromNH, and so forth) slightly differently than
>>>> normal bond parsing, resulting in missing information that the Kekule code
>>>> needs.  I don't believe it's the Kekule code itself since these molecules
>>>> are identical in every respect.  It's just about got to be in the SMILES
>>>> parser.
>>>>
>>>> I have to say, now that I'm looking at the SMILES parser in some detail,
>>>> the aromaticity detection is very confusing.  It's trying to give tentative
>>>> aromaticity assignment to bonds as it parses; for example, at line 850, it
>>>> decides that if this atom is aromatic and the previous one was aromatic,
>>>> then the bond is assigned order 5 ("potential aromatic"), which is bogus.
>>>>
>>>> This may just be a matter of misleading comments.  If bond order 5 is
>>>> considered "unspecified bond type" rather than the misleading "potential
>>>> aromatic," the semantics of the code would be more correct.
>>>>
>>>> Slogging onward...
>>>>
>>>> Craig
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Enter the BlackBerry Developer Challenge
>>>> This is your chance to win up to $100,000 in prizes! For a limited time,
>>>> vendors submitting new applications to BlackBerry App World(TM) will have
>>>> the opportunity to enter the BlackBerry Developer Challenge. See full
>>>> prize
>>>> details at: http://p.sf.net/sfu/Challenge
>>>> _______________________________________________
>>>> OpenBabel-Devel mailing list
>>>> OpenBabel-Devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> OpenBabel-Devel mailing list
>>> OpenBabel-Devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>>>
>>>
>>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Reply via email to