I'll note that the official definitions for all the chemical entities in the PDB can be found in the wwPDB's Chemical Component Dictionary: https://www.wwpdb.org/data/ccd
That's in mmCIF format, but there are various SMILES and InChI definitions for the residues included in the file. (Your mileage may vary for the quality of those representations, though, especially for the rarer ones, but it should be no worse than the SDFs.) You should be able to use an mmCIF parser to extract them. e.g. from mmcif.core.mmciflib import ParseCifSimple # py-mmcif from the RCSB: `pip install mmcif` ccd = ParseCifSimple("components.cif", True, 0, 255, "?", "logfile.txt") # logfile.txt is an arbitrary name ALA = ccd.GetBlock("ALA") desc = ALA.GetTable("pdbx_chem_comp_descriptor") print( desc.GetColumnNames() ) for ii in range(desc.GetNumRows()): print( desc.GetRow(ii) ) *['comp_id', 'type', 'program', 'program_version', 'descriptor']* *['ALA', 'SMILES', 'ACDLabs', '10.04', 'O=C(O)C(N)C']['ALA', 'SMILES_CANONICAL', 'CACTVS', '3.341', 'C[C@H](N)C(O)=O']['ALA', 'SMILES', 'CACTVS', '3.341', 'C[CH](N)C(O)=O']['ALA', 'SMILES_CANONICAL', 'OpenEye OEToolkits', '1.5.0', 'C[C@@H](C(=O)O)N']['ALA', 'SMILES', 'OpenEye OEToolkits', '1.5.0', 'CC(C(=O)O)N']['ALA', 'InChI', 'InChI', '1.03', 'InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1']['ALA', 'InChIKey', 'InChI', '1.03', 'QNAYBMKLOCPYGJ-REOHCLBHSA-N']* The components file is rather large, so parsing time might be a little long at times. On Fri, Oct 27, 2023 at 10:55 AM He, Amy <he.1...@buckeyemail.osu.edu> wrote: > Dear RDKit experts, > > > > I need your advice on finding a source Smiles library for reference, to > build the template molecule from Smiles for AssignBondOrdersFromTemplate > <https://www.rdkit.org/docs/source/rdkit.Chem.AllChem.html>. > > > > I am using AssignBondOrdersFromTemplate to perceive bonds in a > residue-wise manner from an input PDB, using a reference Smiles library > like this: > > > > ref_smi = { > > > > "ALA": "NC(C)C(=O)", > > "GLY": "NCC(=O)", > > "ILE": "NC(C(C)CC)C(=O)", > > > > } > > > I wonder if there has been an open reference library for common amino > acids and ligands that present in PDB files. A previous post on > rdkit-discuss ( > https://rdkit-discuss.narkive.com/JM2IGLQz/pdb-reader-and-bond-perception) > points me to this website: > > ftp://ftp.ebi.ac.uk/pub/databases/msd/pdbechem/files/pdb.tar.gz > > and useful links from > > http://www.ebi.ac.uk/pdbe-srv/pdbechem/ > > > > But I am no longer able to access the contents. > > > > I guess we could always generate Smiles from the standardized SDF files.. > Still I am wondering if there is an existing Smiles library (like a > reference datafile), where we can retrieve the Smiles string using the > residue names of common amino acids and maybe also ligands. > > > > Any comments or suggestions would be greatly appreciated. Thank you for > your time and kind support in advance! > > > > > > Bests, > > > > > > -- > > Amy He > > Chemistry Graduate Teaching Assistant > > Hadad Lab > > Ohio State University > > he.1...@osu.edu > > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss