Re: [Rdkit-discuss] list of failed chembl ids
Thank you Andrew for the explanation. I was just commenting to my summer intern that you might weigh in. Brian From: Andrew Dalke [mailto:da...@dalkescientific.com] Sent: Tuesday, August 08, 2017 15:21 To: RDKit Discuss (rdkit-discuss@lists.sourceforge.net) <rdkit-discuss@lists.sourceforge.net> Subject: Re: [Rdkit-discuss] list of failed chembl ids On Aug 8, 2017, at 22:20, Peter S. Shenkin <shen...@gmail.com<mailto:shen...@gmail.com>> wrote: > But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse. As of ChEMBL 23, the following files are available: - the sdf.gz file - pre-computed RDKit Morgan fingerprints in fps.gz format - the database available as an SQLite file I downloaded those three files, de-tar-gz'ed the SQLite database, and did the following: 1) get the ids from the .sdf.gz file 2) get the ids from the .fps.gz file 3) Find the ids which are only in the .sdf.gz file 4) For each id, find its canonical SMILES in the SQLite file 5) Print the list of ids (I also checked that there were no ids in the FPS file which weren't in the SDF.) Here are the SMILES for the 54 structures that method found (Note: this isn't 51. I know the SD and FPS files are not guaranteed to be perfectly synchronized, so perhaps that's the source of the difference?) Only in .fps: 0 ids Only in .sdf: 54 ids CHEMBL1198593 COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-] CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23 CHEMBL1684167 [Te](Cl)(Cl)c1c1COC CHEMBL1684168 [Te](Cl)(Cl)c1c1[C@H](C)OC CHEMBL1684169 [Te](Cl)(Cl)c1c1[C@@H](C)OC CHEMBL1684170 [Te](Br)(Br)c1c1COC CHEMBL1684171 [Te](Br)(Br)c1c1[C@H](C)OC CHEMBL1684172 [Te](Br)(Br)c1c1[C@@H](C)OC CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CC3)/C/2=C\Cl CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2c2 CHEMBL180355 COc1c1C(=O)\C=C(\c2c2OC)/[Te](Cl)(Cl)Cl CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Cl CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1c1 CHEMBL181880 F[As-](F)(F)(F)(F)F CHEMBL1972162 CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3c13)c452c45 CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C CHEMBL1992520 CCN1\C(=C\C#C\C(=C/c2sc3c3[n+]2CC)\C)\Sc4c14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-] CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4 CHEMBL2097021 O[Te](=O)(=O)O CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCC1C[O-][Te+4][O-]1 CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4c4)c(c5c5)c3c2O CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F CHEMBL3187332 CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-] CHEMBL3188868 CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5c45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O CHEMBL3211150 CCC1N1C(=O)N2=NC(=CN2)C(O)(c3c3)c4c4 CHEMBL3348969 CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3c3)C(=O)N CHEMBL3349005 C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2c2)C(=O)N[C@@H](Cc3c3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](N)C(=O)N[C@@H](C(C)O)C(=O)N1 CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL3397072 FC1=Fc2c(C=C1)[nH]cc2C3CCN(N4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3 CHEMBL3544677 CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O CHEMBL3546168 Cl[Te]1(Cl)OCCO1 CHEMBL3558859 C1C[O-][Te+4][O-]1 CHEMBL3558860 C1C[O-][Te+4][O-]1 CHEMBL3558861 CCC1C[O-][Te+4][O-]1 CHEMBL3559384 CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3593577 CN1C(=O)NC2=CN3(=C4NC=CC4=C12)C(C3)N5C(=O)Nc6cnc7[nH]ccc7c56 CHEMBL3594279 C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Br CHEMBL3832892 O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O
Re: [Rdkit-discuss] list of failed chembl ids
On Aug 8, 2017, at 22:20, Peter S. Shenkinwrote: > But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse. As of ChEMBL 23, the following files are available: - the sdf.gz file - pre-computed RDKit Morgan fingerprints in fps.gz format - the database available as an SQLite file I downloaded those three files, de-tar-gz'ed the SQLite database, and did the following: 1) get the ids from the .sdf.gz file 2) get the ids from the .fps.gz file 3) Find the ids which are only in the .sdf.gz file 4) For each id, find its canonical SMILES in the SQLite file 5) Print the list of ids (I also checked that there were no ids in the FPS file which weren't in the SDF.) Here are the SMILES for the 54 structures that method found (Note: this isn't 51. I know the SD and FPS files are not guaranteed to be perfectly synchronized, so perhaps that's the source of the difference?) Only in .fps: 0 ids Only in .sdf: 54 ids CHEMBL1198593 COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-] CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23 CHEMBL1684167 [Te](Cl)(Cl)c1c1COC CHEMBL1684168 [Te](Cl)(Cl)c1c1[C@H](C)OC CHEMBL1684169 [Te](Cl)(Cl)c1c1[C@@H](C)OC CHEMBL1684170 [Te](Br)(Br)c1c1COC CHEMBL1684171 [Te](Br)(Br)c1c1[C@H](C)OC CHEMBL1684172 [Te](Br)(Br)c1c1[C@@H](C)OC CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CC3)/C/2=C\Cl CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2c2 CHEMBL180355 COc1c1C(=O)\C=C(\c2c2OC)/[Te](Cl)(Cl)Cl CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Cl CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1c1 CHEMBL181880 F[As-](F)(F)(F)(F)F CHEMBL1972162 CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3c13)c452c45 CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C CHEMBL1992520 CCN1\C(=C\C#C\C(=C/c2sc3c3[n+]2CC)\C)\Sc4c14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-] CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4 CHEMBL2097021 O[Te](=O)(=O)O CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCC1C[O-][Te+4][O-]1 CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4c4)c(c5c5)c3c2O CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F CHEMBL3187332 CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-] CHEMBL3188868 CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5c45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O CHEMBL3211150 CCC1N1C(=O)N2=NC(=CN2)C(O)(c3c3)c4c4 CHEMBL3348969 CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3c3)C(=O)N CHEMBL3349005 C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2c2)C(=O)N[C@@H](Cc3c3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](N)C(=O)N[C@@H](C(C)O)C(=O)N1 CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL3397072 FC1=Fc2c(C=C1)[nH]cc2C3CCN(N4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3 CHEMBL3544677 CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O CHEMBL3546168 Cl[Te]1(Cl)OCCO1 CHEMBL3558859 C1C[O-][Te+4][O-]1 CHEMBL3558860 C1C[O-][Te+4][O-]1 CHEMBL3558861 CCC1C[O-][Te+4][O-]1 CHEMBL3559384 CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3593577 CN1C(=O)NC2=CN3(=C4NC=CC4=C12)C(C3)N5C(=O)Nc6cnc7[nH]ccc7c56 CHEMBL3594279 C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Br CHEMBL3832892 O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O CHEMBL3832893 O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O CHEMBL3832897 CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4 CHEMBL3833021
Re: [Rdkit-discuss] list of failed chembl ids
I looked up a bunch of these. The ones I saw are ChEMBL activity records, not molecule records, so they do not contain structural data. But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse. -P. -P. On Tue, Aug 8, 2017 at 3:00 PM, Bennion, Brianwrote: > Hello, > > > > If anyone is interested, the list of chembl ids for compounds that had > such crazy 2D sd files are listed below. Several are just different > formulations of the same parent compound. > > > > 181880 > > 450200 > > 1198593 > > 1201364 > > 1977677 > > 1992520 > > 2146259 > > 2146289 > > 2146290 > > 2299271 > > 3182693 > > 3184182 > > 3187332 > > 3188868 > > 3187972 > > 3211150 > > 3349005 > > 3348969 > > 3833021 > > 3397072 > > 3544677 > > 3561635 > > 3593577 > > 3594279 > > 3580437 > > 3558859 > > 3558860 > > 3558861 > > 3832893 > > 3832892 > > 3832897 > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss