Re: [Rdkit-discuss] list of failed chembl ids
Thank you Andrew for the explanation. I was just commenting to my summer intern that you might weigh in. Brian From: Andrew Dalke [mailto:da...@dalkescientific.com] Sent: Tuesday, August 08, 2017 15:21 To: RDKit Discuss (rdkit-discuss@lists.sourceforge.net) Subject: Re: [Rdkit-discuss] list of failed chembl ids On Aug 8, 2017, at 22:20, Peter S. Shenkin mailto:shen...@gmail.com>> wrote: > But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse. As of ChEMBL 23, the following files are available: - the sdf.gz file - pre-computed RDKit Morgan fingerprints in fps.gz format - the database available as an SQLite file I downloaded those three files, de-tar-gz'ed the SQLite database, and did the following: 1) get the ids from the .sdf.gz file 2) get the ids from the .fps.gz file 3) Find the ids which are only in the .sdf.gz file 4) For each id, find its canonical SMILES in the SQLite file 5) Print the list of ids (I also checked that there were no ids in the FPS file which weren't in the SDF.) Here are the SMILES for the 54 structures that method found (Note: this isn't 51. I know the SD and FPS files are not guaranteed to be perfectly synchronized, so perhaps that's the source of the difference?) Only in .fps: 0 ids Only in .sdf: 54 ids CHEMBL1198593 COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-] CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23 CHEMBL1684167 [Te](Cl)(Cl)c1c1COC CHEMBL1684168 [Te](Cl)(Cl)c1c1[C@H](C)OC CHEMBL1684169 [Te](Cl)(Cl)c1c1[C@@H](C)OC CHEMBL1684170 [Te](Br)(Br)c1c1COC CHEMBL1684171 [Te](Br)(Br)c1c1[C@H](C)OC CHEMBL1684172 [Te](Br)(Br)c1c1[C@@H](C)OC CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CC3)/C/2=C\Cl CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2c2 CHEMBL180355 COc1c1C(=O)\C=C(\c2c2OC)/[Te](Cl)(Cl)Cl CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Cl CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1c1 CHEMBL181880 F[As-](F)(F)(F)(F)F CHEMBL1972162 CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3c13)c452c45 CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C CHEMBL1992520 CCN1\C(=C\C#C\C(=C/c2sc3c3[n+]2CC)\C)\Sc4c14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-] CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4 CHEMBL2097021 O[Te](=O)(=O)O CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCC1C[O-][Te+4][O-]1 CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4c4)c(c5c5)c3c2O CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F CHEMBL3187332 CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-] CHEMBL3188868 CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5c45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O CHEMBL3211150 CCC1N1C(=O)N2=NC(=CN2)C(O)(c3c3)c4c4 CHEMBL3348969 CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3c3)C(=O)N CHEMBL3349005 C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2c2)C(=O)N[C@@H](Cc3c3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](N)C(=O)N[C@@H](C(C)O)C(=O)N1 CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL3397072 FC1=Fc2c(C=C1)[nH]cc2C3CCN(N4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3 CHEMBL3544677 CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O CHEMBL3546168 Cl[Te]1(Cl)OCCO1 CHEMBL3558859 C1C[O-][Te+4][O-]1 CHEMBL3558860 C1C[O-][Te+4][O-]1 CHEMBL3558861 CCC1C[O-][Te+4][O-]1 CHEMBL3559384 CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3593577 CN1C(=O)NC2=CN3(=C4NC=CC4=C12)C(C3)N5C(=O)Nc6cnc7[nH]ccc7c56 CHEMBL3594279 C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Br CHEMBL3832892 O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O CHEMBL3832893 O.C
Re: [Rdkit-discuss] list of failed chembl ids
On Aug 8, 2017, at 22:20, Peter S. Shenkin wrote: > But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse. As of ChEMBL 23, the following files are available: - the sdf.gz file - pre-computed RDKit Morgan fingerprints in fps.gz format - the database available as an SQLite file I downloaded those three files, de-tar-gz'ed the SQLite database, and did the following: 1) get the ids from the .sdf.gz file 2) get the ids from the .fps.gz file 3) Find the ids which are only in the .sdf.gz file 4) For each id, find its canonical SMILES in the SQLite file 5) Print the list of ids (I also checked that there were no ids in the FPS file which weren't in the SDF.) Here are the SMILES for the 54 structures that method found (Note: this isn't 51. I know the SD and FPS files are not guaranteed to be perfectly synchronized, so perhaps that's the source of the difference?) Only in .fps: 0 ids Only in .sdf: 54 ids CHEMBL1198593 COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-] CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23 CHEMBL1684167 [Te](Cl)(Cl)c1c1COC CHEMBL1684168 [Te](Cl)(Cl)c1c1[C@H](C)OC CHEMBL1684169 [Te](Cl)(Cl)c1c1[C@@H](C)OC CHEMBL1684170 [Te](Br)(Br)c1c1COC CHEMBL1684171 [Te](Br)(Br)c1c1[C@H](C)OC CHEMBL1684172 [Te](Br)(Br)c1c1[C@@H](C)OC CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CC3)/C/2=C\Cl CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2c2 CHEMBL180355 COc1c1C(=O)\C=C(\c2c2OC)/[Te](Cl)(Cl)Cl CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Cl CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1c1 CHEMBL181880 F[As-](F)(F)(F)(F)F CHEMBL1972162 CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3c13)c452c45 CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C CHEMBL1992520 CCN1\C(=C\C#C\C(=C/c2sc3c3[n+]2CC)\C)\Sc4c14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-] CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4 CHEMBL2097021 O[Te](=O)(=O)O CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1 CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCC1C[O-][Te+4][O-]1 CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4c4)c(c5c5)c3c2O CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F CHEMBL3187332 CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-] CHEMBL3188868 CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5c45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O CHEMBL3211150 CCC1N1C(=O)N2=NC(=CN2)C(O)(c3c3)c4c4 CHEMBL3348969 CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3c3)C(=O)N CHEMBL3349005 C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2c2)C(=O)N[C@@H](Cc3c3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](N)C(=O)N[C@@H](C(C)O)C(=O)N1 CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1 CHEMBL3397072 FC1=Fc2c(C=C1)[nH]cc2C3CCN(N4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3 CHEMBL3544677 CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O CHEMBL3546168 Cl[Te]1(Cl)OCCO1 CHEMBL3558859 C1C[O-][Te+4][O-]1 CHEMBL3558860 C1C[O-][Te+4][O-]1 CHEMBL3558861 CCC1C[O-][Te+4][O-]1 CHEMBL3559384 CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2 CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14 CHEMBL3593577 CN1C(=O)NC2=CN3(=C4NC=CC4=C12)C(C3)N5C(=O)Nc6cnc7[nH]ccc7c56 CHEMBL3594279 C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Br CHEMBL3832892 O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O CHEMBL3832893 O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O CHEMBL3832897 CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4 CHEMBL3833021 CCN1C(=O)c23c(ccc(C1=O)c23)
Re: [Rdkit-discuss] list of failed chembl ids
I looked up a bunch of these. The ones I saw are ChEMBL activity records, not molecule records, so they do not contain structural data. But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse. -P. -P. On Tue, Aug 8, 2017 at 3:00 PM, Bennion, Brian wrote: > Hello, > > > > If anyone is interested, the list of chembl ids for compounds that had > such crazy 2D sd files are listed below. Several are just different > formulations of the same parent compound. > > > > 181880 > > 450200 > > 1198593 > > 1201364 > > 1977677 > > 1992520 > > 2146259 > > 2146289 > > 2146290 > > 2299271 > > 3182693 > > 3184182 > > 3187332 > > 3188868 > > 3187972 > > 3211150 > > 3349005 > > 3348969 > > 3833021 > > 3397072 > > 3544677 > > 3561635 > > 3593577 > > 3594279 > > 3580437 > > 3558859 > > 3558860 > > 3558861 > > 3832893 > > 3832892 > > 3832897 > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] list of failed chembl ids
Hello, If anyone is interested, the list of chembl ids for compounds that had such crazy 2D sd files are listed below. Several are just different formulations of the same parent compound. 181880 450200 1198593 1201364 1977677 1992520 2146259 2146289 2146290 2299271 3182693 3184182 3187332 3188868 3187972 3211150 3349005 3348969 3833021 3397072 3544677 3561635 3593577 3594279 3580437 3558859 3558860 3558861 3832893 3832892 3832897 -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss