Re: [Rdkit-discuss] list of failed chembl ids

2017-08-08 Thread Bennion, Brian
Thank you Andrew for the explanation.  I was just commenting to my summer 
intern that you might weigh in.
Brian

From: Andrew Dalke [mailto:da...@dalkescientific.com]
Sent: Tuesday, August 08, 2017 15:21
To: RDKit Discuss (rdkit-discuss@lists.sourceforge.net) 
<rdkit-discuss@lists.sourceforge.net>
Subject: Re: [Rdkit-discuss] list of failed chembl ids

On Aug 8, 2017, at 22:20, Peter S. Shenkin 
<shen...@gmail.com<mailto:shen...@gmail.com>> wrote:
> But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse.

As of ChEMBL 23, the following files are available:
  - the sdf.gz file
  - pre-computed RDKit Morgan fingerprints in fps.gz format
  - the database available as an SQLite file

I downloaded those three files, de-tar-gz'ed the SQLite database, and did the 
following:

 1) get the ids from the .sdf.gz file
 2) get the ids from the .fps.gz file
 3) Find the ids which are only in the .sdf.gz file
 4) For each id, find its canonical SMILES in the SQLite file
 5) Print the list of ids
(I also checked that there were no ids in the FPS file which weren't in the 
SDF.)

Here are the SMILES for the 54 structures that method found (Note: this isn't 
51. I know the SD and FPS files are not guaranteed to be perfectly 
synchronized, so perhaps that's the source of the difference?)

Only in .fps: 0 ids
Only in .sdf: 54 ids
   CHEMBL1198593 
COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-]
   CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23
   CHEMBL1684167 [Te](Cl)(Cl)c1c1COC
   CHEMBL1684168 [Te](Cl)(Cl)c1c1[C@H](C)OC
   CHEMBL1684169 [Te](Cl)(Cl)c1c1[C@@H](C)OC
   CHEMBL1684170 [Te](Br)(Br)c1c1COC
   CHEMBL1684171 [Te](Br)(Br)c1c1[C@H](C)OC
   CHEMBL1684172 [Te](Br)(Br)c1c1[C@@H](C)OC
   CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O
   CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CC3)/C/2=C\Cl
   CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2c2
   CHEMBL180355 COc1c1C(=O)\C=C(\c2c2OC)/[Te](Cl)(Cl)Cl
   CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Cl
   CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1c1
   CHEMBL181880 F[As-](F)(F)(F)(F)F
   CHEMBL1972162 
CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F
   CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3c13)c452c45
   CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C
   CHEMBL1992520 
CCN1\C(=C\C#C\C(=C/c2sc3c3[n+]2CC)\C)\Sc4c14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-]
   CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4
   CHEMBL2097021 O[Te](=O)(=O)O
   CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2
   CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1
   CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1
   CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1
   CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCC1C[O-][Te+4][O-]1
   CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4c4)c(c5c5)c3c2O
   CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F
   CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F
   CHEMBL3187332 
CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O
   CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-]
   CHEMBL3188868 
CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5c45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O
   CHEMBL3211150 CCC1N1C(=O)N2=NC(=CN2)C(O)(c3c3)c4c4
   CHEMBL3348969 
CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3c3)C(=O)N
   CHEMBL3349005 
C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2c2)C(=O)N[C@@H](Cc3c3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](N)C(=O)N[C@@H](C(C)O)C(=O)N1
   CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1
   CHEMBL3397072 
FC1=Fc2c(C=C1)[nH]cc2C3CCN(N4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3
   CHEMBL3544677 
CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O
   CHEMBL3546168 Cl[Te]1(Cl)OCCO1
   CHEMBL3558859 C1C[O-][Te+4][O-]1
   CHEMBL3558860 C1C[O-][Te+4][O-]1
   CHEMBL3558861 CCC1C[O-][Te+4][O-]1
   CHEMBL3559384 CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2
   CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14
   CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14
   CHEMBL3593577 
CN1C(=O)NC2=CN3(=C4NC=CC4=C12)C(C3)N5C(=O)Nc6cnc7[nH]ccc7c56
   CHEMBL3594279 
C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C
   CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Br
   CHEMBL3832892 
O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O

Re: [Rdkit-discuss] list of failed chembl ids

2017-08-08 Thread Andrew Dalke
On Aug 8, 2017, at 22:20, Peter S. Shenkin  wrote:
> But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse.

As of ChEMBL 23, the following files are available:
  - the sdf.gz file
  - pre-computed RDKit Morgan fingerprints in fps.gz format
  - the database available as an SQLite file

I downloaded those three files, de-tar-gz'ed the SQLite database, and did the 
following:

 1) get the ids from the .sdf.gz file
 2) get the ids from the .fps.gz file
 3) Find the ids which are only in the .sdf.gz file
 4) For each id, find its canonical SMILES in the SQLite file
 5) Print the list of ids
(I also checked that there were no ids in the FPS file which weren't in the 
SDF.)

Here are the SMILES for the 54 structures that method found (Note: this isn't 
51. I know the SD and FPS files are not guaranteed to be perfectly 
synchronized, so perhaps that's the source of the difference?)

Only in .fps: 0 ids
Only in .sdf: 54 ids
   CHEMBL1198593 
COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-]
   CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23
   CHEMBL1684167 [Te](Cl)(Cl)c1c1COC
   CHEMBL1684168 [Te](Cl)(Cl)c1c1[C@H](C)OC
   CHEMBL1684169 [Te](Cl)(Cl)c1c1[C@@H](C)OC
   CHEMBL1684170 [Te](Br)(Br)c1c1COC
   CHEMBL1684171 [Te](Br)(Br)c1c1[C@H](C)OC
   CHEMBL1684172 [Te](Br)(Br)c1c1[C@@H](C)OC
   CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O
   CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CC3)/C/2=C\Cl
   CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2c2
   CHEMBL180355 COc1c1C(=O)\C=C(\c2c2OC)/[Te](Cl)(Cl)Cl
   CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Cl
   CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1c1
   CHEMBL181880 F[As-](F)(F)(F)(F)F
   CHEMBL1972162 
CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F
   CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3c13)c452c45
   CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C
   CHEMBL1992520 
CCN1\C(=C\C#C\C(=C/c2sc3c3[n+]2CC)\C)\Sc4c14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-]
   CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4
   CHEMBL2097021 O[Te](=O)(=O)O
   CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2
   CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1
   CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1
   CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1
   CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCC1C[O-][Te+4][O-]1
   CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4c4)c(c5c5)c3c2O
   CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F
   CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F
   CHEMBL3187332 
CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O
   CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-]
   CHEMBL3188868 
CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5c45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O
   CHEMBL3211150 CCC1N1C(=O)N2=NC(=CN2)C(O)(c3c3)c4c4
   CHEMBL3348969 
CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3c3)C(=O)N
   CHEMBL3349005 
C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2c2)C(=O)N[C@@H](Cc3c3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](N)C(=O)N[C@@H](C(C)O)C(=O)N1
   CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1
   CHEMBL3397072 
FC1=Fc2c(C=C1)[nH]cc2C3CCN(N4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3
   CHEMBL3544677 
CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O
   CHEMBL3546168 Cl[Te]1(Cl)OCCO1
   CHEMBL3558859 C1C[O-][Te+4][O-]1
   CHEMBL3558860 C1C[O-][Te+4][O-]1
   CHEMBL3558861 CCC1C[O-][Te+4][O-]1
   CHEMBL3559384 CC[N+](CC)(CC)Cc1c1.ClC2=C[Te](Cl)(Cl)OC2
   CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14
   CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3c3O2)Oc4c14
   CHEMBL3593577 
CN1C(=O)NC2=CN3(=C4NC=CC4=C12)C(C3)N5C(=O)Nc6cnc7[nH]ccc7c56
   CHEMBL3594279 
C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C
   CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(C3)/C/2=C\Br
   CHEMBL3832892 
O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O
   CHEMBL3832893 
O.CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O
   CHEMBL3832897 
CCN1C(=O)c23c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14%12c%13%14)N5=N4
   CHEMBL3833021 

Re: [Rdkit-discuss] list of failed chembl ids

2017-08-08 Thread Peter S. Shenkin
I looked up a bunch of these. The ones I saw are ChEMBL activity records,
not molecule records, so they do not contain structural data.

But I would be curious to see the 51 CHEMBL SMILES that RDKit could not
parse.

-P.


-P.

On Tue, Aug 8, 2017 at 3:00 PM, Bennion, Brian  wrote:

> Hello,
>
>
>
> If anyone is interested, the list of chembl ids for compounds that had
> such crazy 2D sd files are listed below. Several are just different
> formulations of the same parent compound.
>
>
>
> 181880
>
> 450200
>
> 1198593
>
> 1201364
>
> 1977677
>
> 1992520
>
> 2146259
>
> 2146289
>
> 2146290
>
> 2299271
>
> 3182693
>
> 3184182
>
> 3187332
>
> 3188868
>
> 3187972
>
> 3211150
>
> 3349005
>
> 3348969
>
> 3833021
>
> 3397072
>
> 3544677
>
> 3561635
>
> 3593577
>
> 3594279
>
> 3580437
>
> 3558859
>
> 3558860
>
> 3558861
>
> 3832893
>
> 3832892
>
> 3832897
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss