On Fri, May 11, 2012 at 2:32 AM, Andrew Dalke <[email protected]> wrote: > > I've posted a number of them, which look like parser errors, to the > bug tracker. The ones I'm not so certain about are:
Those are mostly fixed now. I still need to think about 3525673 a bit more. > - 304 records contain an R group in the SMILES output, like > > [R1]C([R2])(O)O CHEBI:63734 > > How do I tell the SMILES reader to accept R groups, or the SMILES writer to > filter those out? The handling of R groups in mol blocks and their translation into something sensible on SMILES output is something that needs to be fixed. It's a good bug to report. > - Two records contain atom maps: > > [*:0]C1C(=O)N2C1(OC)SCC([*:0])=C2C(=O)O CHEBI:55429 > [*:0]C(=O)NC1C(=O)N2C1CCC([*:0])=C2C(O)=O CHEBI:55504 > > How do I tell the SMILES reader to accept atom maps, or the SMILES writer to > filter those out? The SMILES reader does accept atom maps. Unfortunately it only accepts positive integer atom maps. It should be accepting non-negative atom maps. I'll fix this. > - some SMILES contain a '?' in them > > For example, CHEBI:15431 generates the SMILES > > Cc1c2n3c(c1C=C)C=c1c(C)c(C=C)c4n1?[Mg]31n3c(c(C)c(CCC(O)=O)c3=Cc3c(CCC(O)=O)c(C)c(n3?1)=C2)=C4 > > > while the reference SMILES from > http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15431 > is > CC1=C(CCC(O)=O)C2=N\C\1=C/c1c(C)c(C=C)c3\C=C4/N=C(C=c5c(C)c(CCC(O)=O)c(=C2)n5[Mg]n13)C(C=C)=C/4C > > The two '?'s likely come from the bonds in the SD record of type '8'. > > It this the expected behavior? Shouldn't it generate an exception message > somewhere earlier than during reading the structure back in? It's expected (in that it's how the code is written), but it's easy to argue that it's not correct. Here's what's going on: In CTABs, bond type 8 is an "Any" bond; it's supposed to only be used in queries. When a CTAB containing one is parsed, a query bond is added to the molecule. The SMILES writer has logic that inserts a "?" into the SMILES if it sees a bond type it doesn't know how to handle; the "Any bond" query bond falls into that category. The SMILES parser, as you've found, doesn't know what to do with that "?". I see a number of possible solutions: 1) modify the SMILES parser to recognize "?" as a bond with unspecified order. I don't like introducing completely new syntax like this. 2) modify the SMILES writer to write "~" in these cases and the SMILES parser to recognize that. I like this better because "~" at least has meaning in SMARTS. 3) modify the SMILES writer to fail when it sees query features like this. I would lean towards 2), but I'm happy to hear counter-arguments to that. -greg ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

