I used RDKit to process the ChEBI data set. I found a number of oddities beyond the all-too-common
[02:05:37] Explicit valence for atom # 12 N, 4, is greater than permitted [02:05:37] ERROR: Could not sanitize molecule ending on line 16312 I've posted a number of them, which look like parser errors, to the bug tracker. The ones I'm not so certain about are: - 304 records contain an R group in the SMILES output, like [R1]C([R2])(O)O CHEBI:63734 How do I tell the SMILES reader to accept R groups, or the SMILES writer to filter those out? - Two records contain atom maps: [*:0]C1C(=O)N2C1(OC)SCC([*:0])=C2C(=O)O CHEBI:55429 [*:0]C(=O)NC1C(=O)N2C1CCC([*:0])=C2C(O)=O CHEBI:55504 How do I tell the SMILES reader to accept atom maps, or the SMILES writer to filter those out? - some SMILES contain a '?' in them For example, CHEBI:15431 generates the SMILES Cc1c2n3c(c1C=C)C=c1c(C)c(C=C)c4n1?[Mg]31n3c(c(C)c(CCC(O)=O)c3=Cc3c(CCC(O)=O)c(C)c(n3?1)=C2)=C4 while the reference SMILES from http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15431 is CC1=C(CCC(O)=O)C2=N\C\1=C/c1c(C)c(C=C)c3\C=C4/N=C(C=c5c(C)c(CCC(O)=O)c(=C2)n5[Mg]n13)C(C=C)=C/4C The two '?'s likely come from the bonds in the SD record of type '8'. It this the expected behavior? Shouldn't it generate an exception message somewhere earlier than during reading the structure back in? Andrew [email protected] ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

