On Aug 8, 2017, at 22:20, Peter S. Shenkin <shen...@gmail.com> wrote:
> But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse.

As of ChEMBL 23, the following files are available:
  - the sdf.gz file
  - pre-computed RDKit Morgan fingerprints in fps.gz format
  - the database available as an SQLite file

I downloaded those three files, de-tar-gz'ed the SQLite database, and did the 
following:

 1) get the ids from the .sdf.gz file
 2) get the ids from the .fps.gz file
 3) Find the ids which are only in the .sdf.gz file
 4) For each id, find its canonical SMILES in the SQLite file
 5) Print the list of ids
(I also checked that there were no ids in the FPS file which weren't in the 
SDF.)

Here are the SMILES for the 54 structures that method found (Note: this isn't 
51. I know the SD and FPS files are not guaranteed to be perfectly 
synchronized, so perhaps that's the source of the difference?)

Only in .fps: 0 ids
Only in .sdf: 54 ids
   CHEMBL1198593 
COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-]
   CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23
   CHEMBL1684167 CCCC[Te](Cl)(Cl)c1ccccc1COC
   CHEMBL1684168 CCCC[Te](Cl)(Cl)c1ccccc1[C@H](C)OC
   CHEMBL1684169 CCCC[Te](Cl)(Cl)c1ccccc1[C@@H](C)OC
   CHEMBL1684170 CCCC[Te](Br)(Br)c1ccccc1COC
   CHEMBL1684171 CCCC[Te](Br)(Br)c1ccccc1[C@H](C)OC
   CHEMBL1684172 CCCC[Te](Br)(Br)c1ccccc1[C@@H](C)OC
   CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O
   CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CCCCCC3)/C/2=C\Cl
   CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2ccccc2
   CHEMBL180355 COc1ccccc1C(=O)\C=C(\c2ccccc2OC)/[Te](Cl)(Cl)Cl
   CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(CCCCC3)/C/2=C\Cl
   CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1ccccc1
   CHEMBL181880 F[As-](F)(F)(F)(F)F
   CHEMBL1972162 
CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F
   CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3ccccc13)c4cccc5cccc2c45
   CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C
   CHEMBL1992520 
CCN1\C(=C\C#C\C(=C/c2sc3ccccc3[n+]2CC)\C)\Sc4ccccc14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-]
   CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4
   CHEMBL2097021 O[Te](=O)(=O)O
   CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1ccccc1.ClC2=C[Te](Cl)(Cl)OC2
   CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1
   CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1
   CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].CCCCC1C[O-][Te+4][O-]1
   CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCCCCCC1C[O-][Te+4][O-]1
   CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4ccccc4)c(c5ccccc5)c3c2O
   CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F
   CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F
   CHEMBL3187332 
CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O
   CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-]
   CHEMBL3188868 
CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5ccccc45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O
   CHEMBL3211150 CCC1CCCCN1C(=O)N2=NC(=CN2)C(O)(c3ccccc3)c4ccccc4
   CHEMBL3348969 
CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3ccccc3)C(=O)N
   CHEMBL3349005 
C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2ccccc2)C(=O)N[C@@H](Cc3ccccc3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](C(C)O)C(=O)N1
   CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1
   CHEMBL3397072 
FC1=Fc2c(C=C1)[nH]cc2C3CCN(CCCCN4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3
   CHEMBL3544677 
CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O
   CHEMBL3546168 Cl[Te]1(Cl)OCCO1
   CHEMBL3558859 C1C[O-][Te+4][O-]1
   CHEMBL3558860 CCCCC1C[O-][Te+4][O-]1
   CHEMBL3558861 CCCCCCC1C[O-][Te+4][O-]1
   CHEMBL3559384 CC[N+](CC)(CC)Cc1ccccc1.ClC2=C[Te](Cl)(Cl)OC2
   CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3ccccc3O2)Oc4ccccc14
   CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3ccccc3O2)Oc4ccccc14
   CHEMBL3593577 
CN1C(=O)NC2=CN3(=C4NC=CC4=C12)CCCCC(C3)N5C(=O)Nc6cnc7[nH]ccc7c56
   CHEMBL3594279 
C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C
   CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(CCCCC3)/C/2=C\Br
   CHEMBL3832892 
O.CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O
   CHEMBL3832893 
O.CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O
   CHEMBL3832897 
CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4
   CHEMBL3833021 
CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4
   CHEMBL450200 
CC1=NC2=N(=NC=N2)C(=C1)OCCCN(CC(c3ccccc3)c4ccccc4)Cc5cccc(c5Cl)C(F)(F)F
   CHEMBL471869 [Na+].[Na+].[O-][Te](=O)(=O)[O-]

Code is attached.

However, my copy of the RDKit will parse some of these SMILES and the 
corresponding record in the .sdf.gz file, without a problem. For example, 
consider:

   CHEMBL2097021 O[Te](=O)(=O)O

The corresponding record in the .sdf.gz file is:


CHEMBL2097021
  SciTegic12231509382D

  5  4  0  0  0  0            999 V2000
    5.1833   -7.8333    0.0000 Te  0  0
    5.1833   -7.0083    0.0000 O   0  0
    6.0125   -7.8333    0.0000 O   0  0
    5.1833   -8.6583    0.0000 O   0  0
    4.3625   -7.8333    0.0000 O   0  0
  2  1  1  0
  3  1  1  0
  4  1  2  0
  5  1  2  0
M  END
> <chembl_id>
CHEMBL2097021

$$$$

What's happened is that ChEMBL is using RDKit 2016.03.4 to generate the .fps.gz 
file, and the supported valence states for metals has changed in RDKit since 
then to support this chemistry. See https://github.com/rdkit/rdkit/issues/1403 
for some discussion.

Cheers,


                                Andrew
                                da...@dalkescientific.com

from __future__ import print_function

import sys
import gzip
import subprocess
import os
import sqlite3

sdf_filename = "/Users/dalke/cvses/chemfp_benchmark/source_datasets/chembl_23.sdf.gz"
fps_filename = "/Users/dalke/cvses/chemfp_benchmark/source_datasets/chembl_23.fps.gz"
sql_filename = "/Users/dalke/databases/chembl_23/chembl_23_sqlite/chembl_23.db"

_erase = ""
def status(*terms):
    msg = " ".join(map(str, terms))
    global _erase
    if _erase:
        sys.stderr.write(_erase)
    sys.stderr.write(msg)
    _erase = "\r" + (" "*len(msg)) + "\r"
    sys.stderr.flush()

def get_chembl_ids(sdf_filename):
    seen_ids = set()
    p = subprocess.Popen(["zgrep", "--after", "1", "> <chembl_id>", sdf_filename],
                         stdout=subprocess.PIPE)
    for line in p.stdout:
        if line[:1] in (b"-", b">"):
            continue
        id = line.strip().decode("utf8")
        seen_ids.add(id)
        n = len(seen_ids)
        if n % 10000 == 0:
            status("Read", n, "SDF record ids")
    p.wait()
    
    status("")
    return seen_ids

def get_chembl_fp_ids(fps_filename):
    seen_ids = set()
    with gzip.open(fps_filename) as reader:
        for line in reader:
            if line[:1] == b"#":
                continue
            fields = line[:-1].split(b"\t")
            id = fields[1].decode("utf8")
            seen_ids.add(id)
            
            n = len(seen_ids)
            if n % 50000 == 0:
                status("Read", n, "FPS record ids")

    status("")
    return seen_ids

def report(label, ids, db):
    print(label, len(ids), "ids")
    for id in sorted(ids):
        c = db.execute("""
SELECT canonical_smiles FROM molecule_dictionary, compound_structures
  WHERE chembl_id = ?
    AND molecule_dictionary.molregno = compound_structures.molregno""",
    (id,))
        canonical_smiles, = c.fetchone()
        print("  ", id, canonical_smiles)
        
def main():
    if not os.path.exists(sql_filename):
        sys.exit("File %r does not exist" % (sql_filename,))
    db = sqlite3.connect(sql_filename)
    
    fps_ids = get_chembl_fp_ids(fps_filename)
    sdf_ids = get_chembl_ids(sdf_filename)

    report("Only in .fps:", fps_ids - sdf_ids, db)
    report("Only in .sdf:", sdf_ids - fps_ids, db)
               
if __name__ == "__main__":
    main()
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to