On Aug 8, 2017, at 22:20, Peter S. Shenkin <shen...@gmail.com> wrote:
> But I would be curious to see the 51 CHEMBL SMILES that RDKit could not parse.
As of ChEMBL 23, the following files are available:
- the sdf.gz file
- pre-computed RDKit Morgan fingerprints in fps.gz format
- the database available as an SQLite file
I downloaded those three files, de-tar-gz'ed the SQLite database, and did the
following:
1) get the ids from the .sdf.gz file
2) get the ids from the .fps.gz file
3) Find the ids which are only in the .sdf.gz file
4) For each id, find its canonical SMILES in the SQLite file
5) Print the list of ids
(I also checked that there were no ids in the FPS file which weren't in the
SDF.)
Here are the SMILES for the 54 structures that method found (Note: this isn't
51. I know the SD and FPS files are not guaranteed to be perfectly
synchronized, so perhaps that's the source of the difference?)
Only in .fps: 0 ids
Only in .sdf: 54 ids
CHEMBL1198593
COc1cc(ccc1N2=N(N=C(N2)c3ccc(cc3)[N+](=O)[O-])c4ccc(cc4)[N+](=O)[O-])c5ccc(c(OC)c5)N6=N(NC(=N6)c7ccc(cc7)[N+](=O)[O-])c8ccc(cc8)[N+](=O)[O-]
CHEMBL1201364 O[C@H]1[C@@H](O)[C@@H](O[C@@H]1COP(=O)(O)O)N2=CNc3c(S)ncnc23
CHEMBL1684167 CCCC[Te](Cl)(Cl)c1ccccc1COC
CHEMBL1684168 CCCC[Te](Cl)(Cl)c1ccccc1[C@H](C)OC
CHEMBL1684169 CCCC[Te](Cl)(Cl)c1ccccc1[C@@H](C)OC
CHEMBL1684170 CCCC[Te](Br)(Br)c1ccccc1COC
CHEMBL1684171 CCCC[Te](Br)(Br)c1ccccc1[C@H](C)OC
CHEMBL1684172 CCCC[Te](Br)(Br)c1ccccc1[C@@H](C)OC
CHEMBL178180 COc1ccc(cc1)[Te](Cl)(Cl)\C(=C\Cl)\C(C)(C)O
CHEMBL179159 COc1ccc(cc1)[Te]2(Cl)OC3(CCCCCC3)/C/2=C\Cl
CHEMBL180156 COc1ccc(cc1)[Te](Cl)(Cl)\C=C(/Cl)\c2ccccc2
CHEMBL180355 COc1ccccc1C(=O)\C=C(\c2ccccc2OC)/[Te](Cl)(Cl)Cl
CHEMBL180844 COc1ccc(cc1)[Te]2(Cl)OC3(CCCCC3)/C/2=C\Cl
CHEMBL181211 OC(C\C(=C/Cl)\[Te](Cl)(Cl)Cl)c1ccccc1
CHEMBL181880 F[As-](F)(F)(F)(F)F
CHEMBL1972162
CC(C)(C)c1cc2c3c(c1)C(O[Te]3(C)OC2(C(F)(F)F)C(F)(F)F)(C(F)(F)F)C(F)(F)F
CHEMBL1977677 CC(Br)C(=O)N=N1=C2C(=Nc3ccccc13)c4cccc5cccc2c45
CHEMBL1992123 CC1(O)C(C)(O)C2(C)O[Te]3(OC4(C)C(C)(O)C(C)(O)C4(C)O3)OC12C
CHEMBL1992520
CCN1\C(=C\C#C\C(=C/c2sc3ccccc3[n+]2CC)\C)\Sc4ccccc14.[F-][PH2+5]([F-])([F-])([F-])([F-])[F-]
CHEMBL1998318 CC12O[Te]34OC(C)(C1(C)O3)C2(C)O4
CHEMBL2097021 O[Te](=O)(=O)O
CHEMBL2146197 [Cl-].CC[N+](CC)(CC)Cc1ccccc1.ClC2=C[Te](Cl)(Cl)OC2
CHEMBL2146209 [Cl-].Cl[Te]1(Cl)OCCO1
CHEMBL2146259 N.[Cl-].[Cl-].[Cl-].C1C[O-][Te+4][O-]1
CHEMBL2146289 N.[Cl-].[Cl-].[Cl-].CCCCC1C[O-][Te+4][O-]1
CHEMBL2146290 N.[Cl-].[Cl-].[Cl-].CCCCCCC1C[O-][Te+4][O-]1
CHEMBL2299271 CN1C=NNC1(=S)c2sc3nnc(c4ccccc4)c(c5ccccc5)c3c2O
CHEMBL3182693 [NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F
CHEMBL3184182 [Na+].[Na+].F[Si-2](F)(F)(F)(F)F
CHEMBL3187332
CC(=O)OCC(NC(=O)C(CC1=C2=CC=CC=C2N=C1)NC(=O)OC(C)(C)C)C3OC(C(OC(=O)C)C3OC(=O)C)N4C=C(C)C(=O)NC4=O
CHEMBL3187972 CNc1ccc(cc1)C(=O)Oc2cc(ON=[N](O)N(C)C)c(cc2C#N)[N+](=O)[O-]
CHEMBL3188868
CN(C)[N](=NOc1cc(ON=[N+]([O-])N2CCN(CC2)C(=O)c3cc(CC4=NNC(=O)c5ccccc45)ccc3F)c(cc1[N+](=O)[O-])[N+](=O)[O-])O
CHEMBL3211150 CCC1CCCCN1C(=O)N2=NC(=CN2)C(O)(c3ccccc3)c4ccccc4
CHEMBL3348969
CSCC[C@H](NC(=O)[C@H](CC1=CN=C2=CC=CC=C12)NC(=O)CCNC(=O)OC(C)(C)C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3ccccc3)C(=O)N
CHEMBL3349005
C[C@@H](O)[C@@H](CO)NC(=O)[C@@H]1CSSC[C@H](NC(=O)[C@H](N)Cc2ccccc2)C(=O)N[C@@H](Cc3ccccc3)C(=O)N[C@H](CC4=CN=C5=CC=CC=C45)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](C(C)O)C(=O)N1
CHEMBL3392104 [NH4+].[Cl-].Cl[Te]1(Cl)OCCO1
CHEMBL3397072
FC1=Fc2c(C=C1)[nH]cc2C3CCN(CCCCN4C(=O)N5C=CC=CC5=C(C4=O)c6ccc(F)cc6)CC3
CHEMBL3544677
CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H](CCC(=O)N[C@@H](CCC(=O)O)C(=O)O)C(=O)O
CHEMBL3546168 Cl[Te]1(Cl)OCCO1
CHEMBL3558859 C1C[O-][Te+4][O-]1
CHEMBL3558860 CCCCC1C[O-][Te+4][O-]1
CHEMBL3558861 CCCCCCC1C[O-][Te+4][O-]1
CHEMBL3559384 CC[N+](CC)(CC)Cc1ccccc1.ClC2=C[Te](Cl)(Cl)OC2
CHEMBL3561635 O.O.O.O.O=C1O[Mg]2(OC(=O)c3ccccc3O2)Oc4ccccc14
CHEMBL3580437 O=C1O[Mg]2(OC(=O)c3ccccc3O2)Oc4ccccc14
CHEMBL3593577
CN1C(=O)NC2=CN3(=C4NC=CC4=C12)CCCCC(C3)N5C(=O)Nc6cnc7[nH]ccc7c56
CHEMBL3594279
C[C@H]1O[C@H](C[C@H](O)[C@@H]1O)O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@@H](C)C[C@H]4CC[C@@H]5[C@H](C[C@@H](O)[C@]6(C)[C@H](CC[C@]56O)C7=CC(=O)O/C/7=C\c8ccc(cc8)N(C)C)[C@@H]4C)O[C@@H]3C)O[C@@H]2C
CHEMBL361437 COc1ccc(cc1)[Te]2(Cl)OC3(CCCCC3)/C/2=C\Br
CHEMBL3832892
O.CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O
CHEMBL3832893
O.CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4.[O-]Cl(=O)(=O)=O.[O-]Cl(=O)(=O)=O
CHEMBL3832897
CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Zn]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4
CHEMBL3833021
CCN1C(=O)c2cccc3c(ccc(C1=O)c23)N4C=C5CN67CCCN8CCN9%10CCCN(CC6)[Cu]789(N%11=NN(C=C%11C%10)c%12ccc%13C(=O)N(CC)C(=O)c%14cccc%12c%13%14)N5=N4
CHEMBL450200
CC1=NC2=N(=NC=N2)C(=C1)OCCCN(CC(c3ccccc3)c4ccccc4)Cc5cccc(c5Cl)C(F)(F)F
CHEMBL471869 [Na+].[Na+].[O-][Te](=O)(=O)[O-]
Code is attached.
However, my copy of the RDKit will parse some of these SMILES and the
corresponding record in the .sdf.gz file, without a problem. For example,
consider:
CHEMBL2097021 O[Te](=O)(=O)O
The corresponding record in the .sdf.gz file is:
CHEMBL2097021
SciTegic12231509382D
5 4 0 0 0 0 999 V2000
5.1833 -7.8333 0.0000 Te 0 0
5.1833 -7.0083 0.0000 O 0 0
6.0125 -7.8333 0.0000 O 0 0
5.1833 -8.6583 0.0000 O 0 0
4.3625 -7.8333 0.0000 O 0 0
2 1 1 0
3 1 1 0
4 1 2 0
5 1 2 0
M END
> <chembl_id>
CHEMBL2097021
$$$$
What's happened is that ChEMBL is using RDKit 2016.03.4 to generate the .fps.gz
file, and the supported valence states for metals has changed in RDKit since
then to support this chemistry. See https://github.com/rdkit/rdkit/issues/1403
for some discussion.
Cheers,
Andrew
da...@dalkescientific.com
from __future__ import print_function
import sys
import gzip
import subprocess
import os
import sqlite3
sdf_filename = "/Users/dalke/cvses/chemfp_benchmark/source_datasets/chembl_23.sdf.gz"
fps_filename = "/Users/dalke/cvses/chemfp_benchmark/source_datasets/chembl_23.fps.gz"
sql_filename = "/Users/dalke/databases/chembl_23/chembl_23_sqlite/chembl_23.db"
_erase = ""
def status(*terms):
msg = " ".join(map(str, terms))
global _erase
if _erase:
sys.stderr.write(_erase)
sys.stderr.write(msg)
_erase = "\r" + (" "*len(msg)) + "\r"
sys.stderr.flush()
def get_chembl_ids(sdf_filename):
seen_ids = set()
p = subprocess.Popen(["zgrep", "--after", "1", "> <chembl_id>", sdf_filename],
stdout=subprocess.PIPE)
for line in p.stdout:
if line[:1] in (b"-", b">"):
continue
id = line.strip().decode("utf8")
seen_ids.add(id)
n = len(seen_ids)
if n % 10000 == 0:
status("Read", n, "SDF record ids")
p.wait()
status("")
return seen_ids
def get_chembl_fp_ids(fps_filename):
seen_ids = set()
with gzip.open(fps_filename) as reader:
for line in reader:
if line[:1] == b"#":
continue
fields = line[:-1].split(b"\t")
id = fields[1].decode("utf8")
seen_ids.add(id)
n = len(seen_ids)
if n % 50000 == 0:
status("Read", n, "FPS record ids")
status("")
return seen_ids
def report(label, ids, db):
print(label, len(ids), "ids")
for id in sorted(ids):
c = db.execute("""
SELECT canonical_smiles FROM molecule_dictionary, compound_structures
WHERE chembl_id = ?
AND molecule_dictionary.molregno = compound_structures.molregno""",
(id,))
canonical_smiles, = c.fetchone()
print(" ", id, canonical_smiles)
def main():
if not os.path.exists(sql_filename):
sys.exit("File %r does not exist" % (sql_filename,))
db = sqlite3.connect(sql_filename)
fps_ids = get_chembl_fp_ids(fps_filename)
sdf_ids = get_chembl_ids(sdf_filename)
report("Only in .fps:", fps_ids - sdf_ids, db)
report("Only in .sdf:", sdf_ids - fps_ids, db)
if __name__ == "__main__":
main()
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss