Hi Patrick, Thank you for the code and the links, both are very helpful and exactly what I needed.
Many thanks Anthony Kind regards Dr Anthony Nash PhD MRSC Senior Research Scientist Nuffield Department of Clinical Neurosciences RMCR Kellogg College University of Oxford http://www.kellogg.ox.ac.uk/ ________________________________ From: Patrick Walters <wpwalt...@gmail.com> Sent: 12 September 2021 15:27 To: Anthony Nash <anthony.n...@ndcn.ox.ac.uk> Cc: rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net> Subject: Re: [Rdkit-discuss] SMILES from sdf file Hi Anthony, This is pretty easy and you don't need to use PandasTools (although PandasTools are very cool). #!/usr/bin/env python import sys from rdkit import Chem suppl = Chem.SDMolSupplier(sys.argv[1]) for mol in suppl: if mol: print(Chem.MolToSmiles(mol),mol.GetProp("_Name")) By default, Chem.MolToSmiles produces canonical isomeric SMILES. Here's the query I use to get drugs from ChEMBL. select distinct canonical_smiles, chembl_id from compound_structures cs join formulations f on cs.molregno = f.molregno join products p on p.product_id = f.product_id join compound_properties cp on cp.molregno = cs.molregno join molecule_dictionary md on cp.molregno = md.molregno where p.oral = 1 and cp.mw_freebase < 1000 If you just want the data, I have it here. https://github.com/PatWalters/datafiles/blob/main/chembl_drugs.smi Pat On Sun, Sep 12, 2021 at 9:20 AM Anthony Nash <anthony.n...@ndcn.ox.ac.uk<mailto:anthony.n...@ndcn.ox.ac.uk>> wrote: Dear all, This sounded routine enough that I thought I'd seek guidance to save myself hours of hacking and potential misunderstanding. My objective is to generate a canonical SMILES for each compound in an sdf file. The sdf file was downloaded from ChEMBL and contains some +10,000 drugs. I've had a brief look at the RDKit API and I noticed rdkit.Chem.PandasTools.LoadSDF. Unfortunately, there was no function argument documentation, so I'm unsure whether this function yields canonical SMILES data. However, the RDKit website includes the following example which suggests "something" concerning SMILES is being processed: sdfFile = os.path.join(RDConfig.RDDataDir,'NCI/first_200.props.sdf') >>> frame = >>> PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule', ... includeFingerprints=True, removeHs=False, strictParsing=True) Any guidance is hugely appreciated. On the other hand, if anyone can suggest a one-shop list of SMILES in a file for e.g., experimental drugs, FDA approved drugs, "representative" of chemical space, etc., that would also be appreciated. Thanks Anthony _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss