I used RDKit to process the ChEBI data set.

I found a number of oddities beyond the all-too-common

[02:05:37] Explicit valence for atom # 12 N, 4, is greater than permitted
[02:05:37] ERROR: Could not sanitize molecule ending on line 16312


I've posted a number of them, which look like parser errors, to the
bug tracker. The ones I'm not so certain about are:

- 304 records contain an R group in the SMILES output, like

[R1]C([R2])(O)O CHEBI:63734

How do I tell the SMILES reader to accept R groups, or the SMILES writer to 
filter those out?

- Two records contain atom maps:

[*:0]C1C(=O)N2C1(OC)SCC([*:0])=C2C(=O)O CHEBI:55429
[*:0]C(=O)NC1C(=O)N2C1CCC([*:0])=C2C(O)=O CHEBI:55504

How do I tell the SMILES reader to accept atom maps, or the SMILES writer to 
filter those out?

- some SMILES contain a '?' in them

For example, CHEBI:15431 generates the SMILES

Cc1c2n3c(c1C=C)C=c1c(C)c(C=C)c4n1?[Mg]31n3c(c(C)c(CCC(O)=O)c3=Cc3c(CCC(O)=O)c(C)c(n3?1)=C2)=C4


while the reference SMILES from
  http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15431 
is
  
CC1=C(CCC(O)=O)C2=N\C\1=C/c1c(C)c(C=C)c3\C=C4/N=C(C=c5c(C)c(CCC(O)=O)c(=C2)n5[Mg]n13)C(C=C)=C/4C

The two '?'s likely come from the bonds in the SD record of type '8'.

It this the expected behavior? Shouldn't it generate an exception message 
somewhere earlier than during reading the structure back in?



                                Andrew
                                [email protected]



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to