On Fri, May 11, 2012 at 2:32 AM, Andrew Dalke <[email protected]> wrote:
>
> I've posted a number of them, which look like parser errors, to the
> bug tracker. The ones I'm not so certain about are:

Those are mostly fixed now. I still need to think about 3525673 a bit more.

> - 304 records contain an R group in the SMILES output, like
>
> [R1]C([R2])(O)O CHEBI:63734
>
> How do I tell the SMILES reader to accept R groups, or the SMILES writer to 
> filter those out?

The handling of R groups in mol blocks and their translation into
something sensible on SMILES output is something that needs to be
fixed. It's a good bug to report.

> - Two records contain atom maps:
>
> [*:0]C1C(=O)N2C1(OC)SCC([*:0])=C2C(=O)O CHEBI:55429
> [*:0]C(=O)NC1C(=O)N2C1CCC([*:0])=C2C(O)=O CHEBI:55504
>
> How do I tell the SMILES reader to accept atom maps, or the SMILES writer to 
> filter those out?

The SMILES reader does accept atom maps. Unfortunately it only accepts
positive integer atom maps. It should be accepting non-negative atom
maps. I'll fix this.

> - some SMILES contain a '?' in them
>
> For example, CHEBI:15431 generates the SMILES
>
> Cc1c2n3c(c1C=C)C=c1c(C)c(C=C)c4n1?[Mg]31n3c(c(C)c(CCC(O)=O)c3=Cc3c(CCC(O)=O)c(C)c(n3?1)=C2)=C4
>
>
> while the reference SMILES from
>  http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15431
> is
>  CC1=C(CCC(O)=O)C2=N\C\1=C/c1c(C)c(C=C)c3\C=C4/N=C(C=c5c(C)c(CCC(O)=O)c(=C2)n5[Mg]n13)C(C=C)=C/4C
>
> The two '?'s likely come from the bonds in the SD record of type '8'.
>
> It this the expected behavior? Shouldn't it generate an exception message 
> somewhere earlier than during reading the structure back in?

It's expected (in that it's how the code is written), but it's easy to
argue that it's not correct.
Here's what's going on:
In CTABs, bond type 8 is an "Any" bond; it's supposed to only be used
in queries. When a CTAB containing one is parsed, a query bond is
added to the molecule.
The SMILES writer has logic that inserts a "?" into the SMILES if it
sees a bond type it doesn't know how to handle; the "Any bond" query
bond falls into that category.
The SMILES parser, as you've found, doesn't know what to do with that "?".

I see a number of possible solutions:
1) modify the SMILES parser to recognize "?" as a bond with
unspecified order. I don't like introducing completely new syntax like
this.
2) modify the SMILES writer to write "~" in these cases and the SMILES
parser to recognize that. I like this better because "~" at least has
meaning in SMARTS.
3) modify the SMILES writer to fail when it sees query features like this.

I would lean towards 2), but I'm happy to hear counter-arguments to that.

-greg

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to