Hi Lewis,

Currently structures are generated automatically in SureChEMBL so this kind of 
error unfortunately happens…

My colleagues will address this issue as soon as possible.

Cheers,
Nicolas
-----------------------------------------------
Dr Nicolas Bosc
Data Mining and Analysis Scientist
ChEMBL group
EMBL-EBI
Wellcome Genome Campus
Hinxton, Cambridge, CB10 1SD
United Kingdom

nb...@ebi.ac.uk
+44 1223 492519


> On 15 Dec 2021, at 20:42, Lewis Martin <lewis.marti...@gmail.com> wrote:
> 
> Thanks a lot Greg! That is indeed very helpful. 
> 
> Just to know that the molecule is odd is helpful too. The mol blocks appear 
> to be V2000 format and have names like "Mrv0541 03021215572D" which says 
> ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would use such a 
> representation (it doesn't look like a faithful transcription from the source 
> patent). Off-topic, but if anyone happens to have an insight or connection 
> with SureChEMBL, please do reach out!
> 
> Cheers
> Lewis
> 
>  
> 
> 
> On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum <greg.land...@gmail.com 
> <mailto:greg.land...@gmail.com>> wrote:
> Hi Lewis,
> 
> Dealing with all the strange chemical representations that show up "in the 
> wild" is an ongoing struggle.
> 
> Your first example is pretty clearly intended to be an azide and we can 
> certainly add a rule to normalize that one to what the RDKit expects it to be 
> (there already is a rule for C-N=N#N, but that doesn't help here.). That 
> won't happen before the next feature release though.
> 
> I'm not really sure what the intent was for the two four-coordinate neutral 
> Ns in the second molecule, so I think it's unlikely that we'd add a standard 
> cleanup for one.
> 
> However! The good news is that there's a pretty easy (and efficient) way to 
> fix this yourself. We added a new method to chemical reactions in the 2021.09 
> release which allows you to modify a molecule in place (subject to some 
> constraints). This is ideal for doing cleanup transformations like these.
> 
> This gist shows how to write reaction rules for your cases (I guessed for 
> what the Ns are supposed to be) and then use them:
> https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb 
> <https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb>
> 
> Hope this helps,
> -greg
> 
> 
> On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin <lewis.marti...@gmail.com 
> <mailto:lewis.marti...@gmail.com>> wrote:
> Hi All, 
> Reading molecules from a bulk download of SureChEMBL, I come across a fair 
> few molecules that fail to parse. Not sure whether they SHOULD parse or not. 
> 
> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386 
> <https://www.surechembl.org/chemical/SCHEMBL386>
> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
> 
> Even reading the SMILES code one can see that there are too many bonds in 
> there - a nitrogen triply bonded and doubly bonded to other atoms. 
> 
> Another example: https://www.surechembl.org/chemical/SCHEMBL33957 
> <https://www.surechembl.org/chemical/SCHEMBL33957>
> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1
> 
> Again, valence for a nitrogen is off. 
> 
> Should I expect to parse these with RDKit? Might there be some way around 
> this? It's a significant fraction of the molecules in SureChEMBL. 
> 
> Thanks team!
> Lewis 
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net 
> <mailto:Rdkit-discuss@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss 
> <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to