Hi Patrick,

Thank you for the code and the links, both are very helpful and exactly what I 
needed.

Many thanks
Anthony

Kind regards
Dr Anthony Nash PhD MRSC

Senior Research Scientist
Nuffield Department of Clinical Neurosciences
RMCR Kellogg College
University of Oxford
http://www.kellogg.ox.ac.uk/

________________________________
From: Patrick Walters <wpwalt...@gmail.com>
Sent: 12 September 2021 15:27
To: Anthony Nash <anthony.n...@ndcn.ox.ac.uk>
Cc: rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net>
Subject: Re: [Rdkit-discuss] SMILES from sdf file

Hi Anthony,

This is pretty easy and you don't need to use PandasTools (although PandasTools 
are very cool).

#!/usr/bin/env python

import sys
from rdkit import Chem

suppl = Chem.SDMolSupplier(sys.argv[1])
for mol in suppl:
    if mol:
        print(Chem.MolToSmiles(mol),mol.GetProp("_Name"))

By default, Chem.MolToSmiles produces canonical isomeric SMILES.

Here's the query I use to get drugs from ChEMBL.


select distinct canonical_smiles, chembl_id from compound_structures cs
join formulations f on cs.molregno = f.molregno
join products p on p.product_id = f.product_id
join compound_properties cp on cp.molregno = cs.molregno
join molecule_dictionary md on cp.molregno = md.molregno
where p.oral = 1
and cp.mw_freebase < 1000

If you just want the data, I have it here.

https://github.com/PatWalters/datafiles/blob/main/chembl_drugs.smi

Pat


On Sun, Sep 12, 2021 at 9:20 AM Anthony Nash 
<anthony.n...@ndcn.ox.ac.uk<mailto:anthony.n...@ndcn.ox.ac.uk>> wrote:
Dear all,

This sounded routine enough that I thought I'd seek guidance to save myself 
hours of hacking and potential misunderstanding.

My objective is to generate a canonical SMILES for each compound in an sdf 
file. The sdf file was downloaded from ChEMBL and contains some +10,000 drugs. 
I've had a brief look at the RDKit API and I noticed 
rdkit.Chem.PandasTools.LoadSDF.

Unfortunately, there was no function argument documentation, so I'm unsure 
whether this function yields canonical SMILES data. However, the RDKit website 
includes the following example which suggests "something" concerning SMILES is 
being processed:


sdfFile = os.path.join(RDConfig.RDDataDir,'NCI/first_200.props.sdf')
>>> frame = 
>>> PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule',
...            includeFingerprints=True, removeHs=False, strictParsing=True)

Any guidance is hugely appreciated.

On the other hand, if anyone can suggest a one-shop list of SMILES in a file 
for e.g., experimental drugs, FDA approved drugs, "representative" of chemical 
space, etc., that would also be appreciated.


Thanks
Anthony
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to