Hi Lewis, Currently structures are generated automatically in SureChEMBL so this kind of error unfortunately happens…
My colleagues will address this issue as soon as possible. Cheers, Nicolas ----------------------------------------------- Dr Nicolas Bosc Data Mining and Analysis Scientist ChEMBL group EMBL-EBI Wellcome Genome Campus Hinxton, Cambridge, CB10 1SD United Kingdom nb...@ebi.ac.uk +44 1223 492519 > On 15 Dec 2021, at 20:42, Lewis Martin <lewis.marti...@gmail.com> wrote: > > Thanks a lot Greg! That is indeed very helpful. > > Just to know that the molecule is odd is helpful too. The mol blocks appear > to be V2000 format and have names like "Mrv0541 03021215572D" which says > ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would use such a > representation (it doesn't look like a faithful transcription from the source > patent). Off-topic, but if anyone happens to have an insight or connection > with SureChEMBL, please do reach out! > > Cheers > Lewis > > > > > On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum <greg.land...@gmail.com > <mailto:greg.land...@gmail.com>> wrote: > Hi Lewis, > > Dealing with all the strange chemical representations that show up "in the > wild" is an ongoing struggle. > > Your first example is pretty clearly intended to be an azide and we can > certainly add a rule to normalize that one to what the RDKit expects it to be > (there already is a rule for C-N=N#N, but that doesn't help here.). That > won't happen before the next feature release though. > > I'm not really sure what the intent was for the two four-coordinate neutral > Ns in the second molecule, so I think it's unlikely that we'd add a standard > cleanup for one. > > However! The good news is that there's a pretty easy (and efficient) way to > fix this yourself. We added a new method to chemical reactions in the 2021.09 > release which allows you to modify a molecule in place (subject to some > constraints). This is ideal for doing cleanup transformations like these. > > This gist shows how to write reaction rules for your cases (I guessed for > what the Ns are supposed to be) and then use them: > https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb > <https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb> > > Hope this helps, > -greg > > > On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin <lewis.marti...@gmail.com > <mailto:lewis.marti...@gmail.com>> wrote: > Hi All, > Reading molecules from a bulk download of SureChEMBL, I come across a fair > few molecules that fail to parse. Not sure whether they SHOULD parse or not. > > Here is an example: https://www.surechembl.org/chemical/SCHEMBL386 > <https://www.surechembl.org/chemical/SCHEMBL386> > with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1 > > Even reading the SMILES code one can see that there are too many bonds in > there - a nitrogen triply bonded and doubly bonded to other atoms. > > Another example: https://www.surechembl.org/chemical/SCHEMBL33957 > <https://www.surechembl.org/chemical/SCHEMBL33957> > smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1 > > Again, valence for a nitrogen is off. > > Should I expect to parse these with RDKit? Might there be some way around > this? It's a significant fraction of the molecules in SureChEMBL. > > Thanks team! > Lewis > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > <mailto:Rdkit-discuss@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss