Re: [Rdkit-discuss] Problems reading XYZ file
Hi Guys, I'm sorry it took me this long to try it... But I could finally get to it, and it works well now. Thanks for your help! -- Gustavo Seabra. On Tue, Apr 11, 2023 at 3:19 AM Jan Halborg Jensen wrote: > Hi Gustavo > > raw_mol = Chem.MolFromXYZFile('acetate.xyz') > mol = Chem.Mol(raw_mol) > rdDetermineBonds.DetermineBonds(mol,charge=-1) > > Best regards, Jan > > On 7 Apr 2023, at 22.57, Gustavo Seabra wrote: > > Hi everyone, > > I'm having difficulties using RDKit to read molecules from an XYZ file, > and I would really appreciate some help. > > The problem is that whenever i read a molecule from an XYZ file, I get > just a disconnected clump of atoms, not a molecule. For example: the > following code: > > import rdkit > from rdkit import Chem > from rdkit.Chem import Draw, rdmolfiles > mol = Chem.MolFromSmiles('COC1=C(O)C[C@@](O)(CO)CC1=O') > mol = Chem.AddHs(mol) > mol > > > > Chem.AllChem.EmbedMolecule(mol) > Chem.MolToXYZFile(mol, "rdkit_mol.xyz") > mol2 = Chem.MolFromXYZFile('rdkit_mol.xyz') > mol2 > > Is there a bug on the XYZ code, or am I missing something? > > Thanks! > -- > Gustavo Seabra. > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss=05%7C01%7Cjhjensen%40chem.ku.dk%7Ca747765687134eda68a708db37ab1ba1%7Ca3927f91cda14696af898c9f1ceffa91%7C0%7C0%7C638164980266752900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=%2FKeB%2FR%2FQzRDYIe9zpZfKMqbjNYULOH4VQ5jhfJmxK6I%3D=0 > > > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Problems reading XYZ file
Hi everyone, I'm having difficulties using RDKit to read molecules from an XYZ file, and I would really appreciate some help. The problem is that whenever i read a molecule from an XYZ file, I get just a disconnected clump of atoms, not a molecule. For example: the following code: import rdkit from rdkit import Chem from rdkit.Chem import Draw, rdmolfiles mol = Chem.MolFromSmiles('COC1=C(O)C[C@@](O)(CO)CC1=O') mol = Chem.AddHs(mol) mol [image: image.png] Chem.AllChem.EmbedMolecule(mol) Chem.MolToXYZFile(mol, "rdkit_mol.xyz") mol2 = Chem.MolFromXYZFile('rdkit_mol.xyz') mol2 [image: image.png] Is there a bug on the XYZ code, or am I missing something? Thanks! -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Generating 3D molecules for docking
Hi Francesca, As far as I know (someone please correct me if I'm wrong), RDKit can read but cannot save the files in Mol2 format. But if you have the file in SDF format, you can convert them to Mol2 using OpenBabel. The command would be something like: $ obabel -isdf sdf_file.sdf -omol2 -Omol2_file.mol2 -m The -m tells obabel to split the multimolecule file into individual molecules. -- Gustavo Seabra. On Tue, Jul 27, 2021 at 1:37 PM Francesca Magarotto - francesca.magarot...@studio.unibo.it wrote: > Hi, > after a cluster analysis using a dataset of compounds from ZINC15 (in > smiles format) I have picked a subset for virtual screening. > However, I have a problem. > The program Dock6 reads only TRIPOS mol2 format: is it possible to convert > the molecules I chose for virtual screening with RDKit? > In ZINC15 the molecules are also provided in mol2 format, but in this case > I download all of them and not only the ones I selected after cluster > analysis. > I don't know what to do. > Thanks, > regards. > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS
On Fri, Jul 23, 2021 at 4:53 AM Paolo Tosco wrote: > > # here there seems to a be a bug with the 2D depiction, but that's another > story > > template > > [image: image.png] > > Just a quick thing: I don't know if this is supposed to be a bug or a feature, but I noticed that this seems to be caused by properties of the Mol created from SMARTS *not* being set when the mol is created, but only when they are requested the first time. Right after creating the mol object by Chem.MolFromSmiles the IsInRing doesn't seem to be set correctly (or at all), and the comparison operation distorts the molecule. But, if you force the computation of the properties, e.g. by printing them, for idx, atom in enumerate(template.GetAtoms()): print(f"{idx:>4d} {atom.GetAtomicNum():5d} {str(atom.IsInRing()):>7} {str(atom.GetIsAromatic()):>5}") After that, all seems to work as expected \o/. I don't know if it is by design that the properties are calculated only when needed? Gustavo. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS
Thanks a lot! -- Gustavo Seabra. On Fri, Jul 23, 2021 at 12:18 PM Paolo Tosco wrote: > Hi Gustavo, > > Chem.Atom.HasQuery() and Chem.Bond.HasQuery() return True when the > underlying atom (or bond) is an instance of Chem.QueryAtom (or Chem. > QueryBond). > Query atoms and bonds can either be defined through SMARTS expressions... > > from rdkit import Chem > from rdkit.Chem import rdqueries > > a = Chem.Atom(6) > a.HasQuery() > False > > mol = Chem.MolFromSmarts("[+1;D3]") > qa_from_smarts = mol.GetAtomWithIdx(0) > qa_from_smarts.HasQuery() > True > > qa_from_smarts.DescribeQuery() > 'AtomAnd\n AtomFormalCharge 1 = val\n AtomExplicitDegree 3 = val\n' > > ...or be directly instantiated from Python and combined at your leisure; > through this approach you can actually define very specific queries that > may not be possible to describe with SMARTS. > Below I show how to construct the same query atom as from the above SMARTS > expression: > > qa = rdqueries.FormalChargeEqualsQueryAtom(1) > qa > > > qa.HasQuery() > True > > qa.DescribeQuery() > 'AtomFormalCharge 1 = val\n' > > qa2 = rdqueries.ExplicitDegreeEqualsQueryAtom(3) > qa2.DescribeQuery() > 'AtomExplicitDegree 3 = val\n' > > qa.ExpandQuery(qa2) > qa.DescribeQuery() > 'AtomAnd\n AtomFormalCharge 1 = val\n AtomExplicitDegree 3 = val\n' > > Cheers, > p. > > On Fri, Jul 23, 2021 at 5:47 PM Gustavo Seabra > wrote: > >> This works perfectly! >> >> I could understand most of what you did there ;-), but what does the >> ".HasQuery()" mean? The RDKit API is not very clear about it: "Returns >> whether or not the atom has an associated query". Is this described >> anywhere else? >> >> Thank you so much! >> -- >> Gustavo Seabra. >> >> >> On Fri, Jul 23, 2021 at 4:53 AM Paolo Tosco >> wrote: >> >>> Hi Gustavo, >>> >>> you should be able to address this with a custom AtomCompare (and >>> BondCompare, if you want to use bond queries too) class, that now is >>> also supported from Python. >>> You can take a look at Code/GraphMol/FMCS/Wrap/testFMCS.py for >>> inspiration how to use it; here's something that seems to work for your >>> example: >>> >>> from rdkit import Chem >>> from rdkit.Chem import rdFMCS >>> >>> template = >>> Chem.MolFromSmarts('[a]1(-[S](-*)(=[O])=[O]):[a]:[a]:[a]:[a]:[a]:1') >>> # This should give a sulfone connected to an aromatic ring and >>> # some other (any) element. Notice that the ring may have >>> # any atoms (N,C,O), but for me it is important to have the SO2 group. >>> >>> template >>> [image: image.png] >>> >>> mol1 = Chem.MolFromSmiles('CS(=O)(=O)c1ccc(C2=C(c3c3)CCN2)cc1') >>> # This molecule has the pattern. >>> >>> mol1 >>> [image: image.png] >>> >>> compare = [template, mol1] >>> res = rdFMCS.FindMCS(compare, >>> atomCompare=rdFMCS.AtomCompare.CompareElements, >>> bondCompare=rdFMCS.BondCompare.CompareAny, >>> ringMatchesRingOnly=False, >>> completeRingsOnly=False) >>> res.smartsString >>> # gives: '[#16](=[#8])=[#8]' >>> >>> # Let's address the problem with a custom AtomCompare class: >>> >>> class CompareQueryAtoms(rdFMCS.MCSAtomCompare): >>> def __call__(self, p, mol1, atom1, mol2, atom2): >>> a1 = mol1.GetAtomWithIdx(atom1) >>> a2 = mol2.GetAtomWithIdx(atom2) >>> if ((not a1.HasQuery()) and (not a2.HasQuery()) and >>> a1.GetAtomicNum() != a2.GetAtomicNum()): >>> return False >>> if (p.MatchValences and a1.GetTotalValence() != >>> a2.GetTotalValence()): >>> return False >>> if (p.MatchChiralTag and not self.CheckAtomChirality(p, mol1, >>> atom1, mol2, atom2)): >>> return False >>> if (p.MatchFormalCharge and (not a1.HasQuery()) and (not >>> a2.HasQuery()) and not self.CheckAtomCharge(p, mol1, atom1, mol2, atom2)): >>> return False >>> if p.RingMatchesRingOnly: >>> return self.CheckAtomRingMatch(p, mol1, atom1, mol2, atom2) >>> if ((a1.HasQuery() or a2.HasQuery()) and (not a1.Match(a2))): >>> return False >>> return True >>> >>> params = rdFMCS.MCSParameters() >>> params.AtomCompareParamet
Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS
This works perfectly! I could understand most of what you did there ;-), but what does the ".HasQuery()" mean? The RDKit API is not very clear about it: "Returns whether or not the atom has an associated query". Is this described anywhere else? Thank you so much! -- Gustavo Seabra. On Fri, Jul 23, 2021 at 4:53 AM Paolo Tosco wrote: > Hi Gustavo, > > you should be able to address this with a custom AtomCompare (and > BondCompare, if you want to use bond queries too) class, that now is also > supported from Python. > You can take a look at Code/GraphMol/FMCS/Wrap/testFMCS.py for > inspiration how to use it; here's something that seems to work for your > example: > > from rdkit import Chem > from rdkit.Chem import rdFMCS > > template = > Chem.MolFromSmarts('[a]1(-[S](-*)(=[O])=[O]):[a]:[a]:[a]:[a]:[a]:1') > # This should give a sulfone connected to an aromatic ring and > # some other (any) element. Notice that the ring may have > # any atoms (N,C,O), but for me it is important to have the SO2 group. > > template > [image: image.png] > > mol1 = Chem.MolFromSmiles('CS(=O)(=O)c1ccc(C2=C(c3c3)CCN2)cc1') > # This molecule has the pattern. > > mol1 > [image: image.png] > > compare = [template, mol1] > res = rdFMCS.FindMCS(compare, > atomCompare=rdFMCS.AtomCompare.CompareElements, > bondCompare=rdFMCS.BondCompare.CompareAny, > ringMatchesRingOnly=False, > completeRingsOnly=False) > res.smartsString > # gives: '[#16](=[#8])=[#8]' > > # Let's address the problem with a custom AtomCompare class: > > class CompareQueryAtoms(rdFMCS.MCSAtomCompare): > def __call__(self, p, mol1, atom1, mol2, atom2): > a1 = mol1.GetAtomWithIdx(atom1) > a2 = mol2.GetAtomWithIdx(atom2) > if ((not a1.HasQuery()) and (not a2.HasQuery()) and > a1.GetAtomicNum() != a2.GetAtomicNum()): > return False > if (p.MatchValences and a1.GetTotalValence() != > a2.GetTotalValence()): > return False > if (p.MatchChiralTag and not self.CheckAtomChirality(p, mol1, > atom1, mol2, atom2)): > return False > if (p.MatchFormalCharge and (not a1.HasQuery()) and (not > a2.HasQuery()) and not self.CheckAtomCharge(p, mol1, atom1, mol2, atom2)): > return False > if p.RingMatchesRingOnly: > return self.CheckAtomRingMatch(p, mol1, atom1, mol2, atom2) > if ((a1.HasQuery() or a2.HasQuery()) and (not a1.Match(a2))): > return False > return True > > params = rdFMCS.MCSParameters() > params.AtomCompareParameters.RingMatchesRingOnly = False > params.BondCompareParameters.RingMatchesRingOnly = False > params.AtomCompareParameters.CompleteRingsOnly = False > params.BondCompareParameters.CompleteRingsOnly = False > params.BondTyper = rdFMCS.BondCompare.CompareAny > params.AtomTyper = CompareQueryAtoms() > > compare = [template, mol1] > res = rdFMCS.FindMCS(compare, params) > res.smartsString > > '[#16](-[#0,#6])(=[#8])(=[#8])-[#0,#6]1:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#6]:1' > > > # the queryMol returned by MCS will match the template, but the original > template query > > # has many more details, so we extract the MCS part of the original template > and use that > > # as query instead > > def trim_template(template, query): > template_mcs_core = Chem.ReplaceSidechains(template, query) > for a in template_mcs_core.GetAtoms(): > if (not a.GetAtomicNum()) and a.GetIsotope(): > a.SetAtomicNum(1) > a.SetIsotope(0) > return Chem.RemoveAllHs(template_mcs_core) > > > query_mol = trim_template(template, res.queryMol) > template.GetSubstructMatch(query_mol) > > (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) > > > # here there seems to a be a bug with the 2D depiction, but that's another > story > > template > > [image: image.png] > > mol1.GetSubstructMatches(query_mol) > > ((4, 1, 0, 2, 3, 5, 6, 7, 19, 20),) > > > mol1 > > [image: image.png] > > > mol2 = Chem.MolFromSmiles('Cc1ccc(C2=CCNC2c2ccc(C(C)(F)F)nc2)nn1') > compare = [template, mol2] > > > mol2 > > [image: image.png] > > > res = rdFMCS.FindMCS(compare, params) > res.smartsString > > '[#0,#6]1:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#7]:[#0,#7]:1' > > > query_mol = trim_template(template, res.queryMol) > > query_mol > > [image: image.png] > > > mol2.GetSubstructMatches(query_mol) > > ((1, 2, 3, 4, 20, 21), (10, 11, 12, 13, 18, 19)) > > > mol2 > > [image: image.png] > > > I hope the above helps, cheers > > p. > > > O
Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS
Hi, Thanks a lot for the reply! However, in this case, it looks like I would have to somehow label the isotope in every query molecule, right? For example: ``` template = Chem.MolFromSmarts('[c]1(-[2S](=[3O])(=[3O])(-C)):[c]:[c]:[c]:[c]:[c]:1') mol1 = Chem.MolFromSmiles('CS(=O)(=O)c1ccc(C2=C(c3c3)CCN2)cc1') compare = [template,mol1] res = rdFMCS.FindMCS(compare, atomCompare=rdFMCS.AtomCompare.CompareIsotopes, bondCompare=rdFMCS.BondCompare.CompareAny, ringMatchesRingOnly=False, completeRingsOnly=False) res.smartsString ``` returns: '[0*]1:[0*]:[0*]:[0*]:[0*]:[0*]:1', that is, it only picks the ring but not the sulfone. I actually want the sulfone to be found, if it is there. My problem is that I also want flexibility to change the ring atoms and still find the ring as a match, while considering a match on the sulfone only if it really is there. (e.g., CF3 should *not* match.) Does it make sense? Thanks a lot! -- Gustavo Seabra. On Thu, Jul 22, 2021 at 4:52 PM Andrew Dalke wrote: > Hi Gustavo, > > > > template = > Chem.MolFromSmarts('[a]1(-[S](-*)(=[O])=[O]):[a]:[a]:[a]:[a]:[a]:1') > > Unless things have changed since I last looked at the algorithm, you can't > meaningfully pass a SMARTS-based query molecule into the MCS program, > outside of a few simple cases. > > It generates a SMARTS pattern based on the properties of the molecule. You > asked it to CompareElements, but those [a] terms all have an atomic number > of 0. > > >>> template = > Chem.MolFromSmarts('[a#1]1(-[S](-*)(=[O])=[O]):[a#1]:[a#1]:[a#1]:[a#1]:[a#1]:1') > >>> [a.GetAtomicNum() for a in template.GetAtoms()] > [0, 16, 0, 8, 8, 0, 0, 0, 0, 0] > > That's why your CompareAny search returns the #0 terms, like: > > > '[#16,#6](-[#0,#6])(=,-[#8,#9])(=,-[#8,#9])-[#0,#6]1:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#7]:1' > > > I would appreciate some pointers on how it would be possible to find the > maximum common substructure of 2 molecules, where in the template structure > some atoms may be *any*, but some other atoms must be fixed. > > Perhaps with isotope labelling? > > That is, label the "any" atoms as isotope 1, and label your > -[S](=[O])(=[O])- as -[2S](=[3O])(=[3O])- > > Then use rdFMCS.AtomCompare.CompareIsotopes . > > If there's anything you don't want to match at all, give each atom a > unique isotope value. > > Best regards, > > Andrew > da...@dalkescientific.com > > > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Maximum Common Substructure using SMARTS
Hi all,, I would appreciate some pointers on how it would be possible to find the maximum common substructure of 2 molecules, where in the template structure some atoms may be *any*, but some other atoms must be fixed. Currently, I'm trying to use rdFMCS module. For example: from rdkit import Chem from rdkit.Chem import rdFMCS template = Chem.MolFromSmarts('[a]1(-[S](-*)(=[O])=[O]):[a]:[a]:[a]:[a]:[a]:1') # This should give a sulfone connected to an aromatic ring and # some other (any) element. Notice that the ring may have # any atoms (N,C,O), but for me it is important to have the SO2 group. mol1 = Chem.MolFromSmiles('CS(=O)(=O)c1ccc(C2=C(c3c3)CCN2)cc1') # This molecule has the pattern. # Now, if I try to find a substructure match, I use: compare = [template, mol1] res = rdFMCS.FindMCS(compare, atomCompare=rdFMCS.AtomCompare.CompareElements, bondCompare=rdFMCS.BondCompare.CompareAny, ringMatchesRingOnly=False, completeRingsOnly=False) res.smartsString # gives: '[#16](=[#8])=[#8]' # Notice that the only match is the SO2, it does not match the ring. However, if I try that with another structure that has a CF3 in place of the SO2, I get: mol2 = Chem.MolFromSmiles('Cc1ccc(C2=CCNC2c2ccc(C(C)(F)F)nc2)nn1') compare = [template,mol2] res = rdFMCS.FindMCS(compare, atomCompare=rdFMCS.AtomCompare.CompareElements, bondCompare=rdFMCS.BondCompare.CompareAny, ringMatchesRingOnly=False, completeRingsOnly=False) res.smartsString # Returns: '' (empty string) # if I change to AtomCompare.CompareAny, now a CF3 will also match # in the SO2-X: mol2 = Chem.MolFromSmiles('Cc1ccc(C2=CCNC2c2ccc(C(C)(F)F)nc2)nn1') compare = [template,mol2] res = rdFMCS.FindMCS(compare, atomCompare=rdFMCS.AtomCompare.CompareAny, bondCompare=rdFMCS.BondCompare.CompareAny, ringMatchesRingOnly=False, completeRingsOnly=False) res.smartsString # Returns: '[#16,#6](-[#0,#6])(=,-[#8,#9])(=,-[#8,#9])-[#0,#6]1:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#7]:1' But now theCF3 is counted in place of the SO2. The result I'd like to get here would be just the ring, as in the case: new_template = Chem.MolFromSmarts('CS(=O)(=O)c1cnccc1') mol2 = Chem.MolFromSmiles('Cc1ccc(C2=CCNC2c2ccc(C(C)(F)F)nc2)nn1') compare = [new_template,mol2] res = rdFMCS.FindMCS(compare, atomCompare=rdFMCS.AtomCompare.CompareElements, bondCompare=rdFMCS.BondCompare.CompareAny, ringMatchesRingOnly=False, completeRingsOnly=False) res.smartsString # Returns: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (just the ring) Notice that if I use CompareElements, there seems to be no way to match the ring with either N or C. Does anyone have a suggestion on how I can specify flexibility (similar to AtomCompare.CompareAny) only for a portion of the molecule and still enforce specific atoms in another portion? Thank you so much! -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Autodock Vina
Hi Valik, I do this on a regular basis for our generators here. Basically what you will need is to: 1. Generate 3D structures for the molecules (RDKit can do that) 2. Save to SDF files (again, RDKit) 3. Convert to PDBQT (I use OpenBabel: "$ obabel -isdf structures.sdf -opdbqt -Oname-.pdbqt -m") Then you'll have the files you need. Of course, you will still need to build the pdbqt file for the target and the vina_config file, but that you only need to do once. All the best, -- Gustavo Seabra. On Tue, Jun 22, 2021 at 4:08 AM Velik Velikov wrote: > Dear all, > > > > I am constructing new molecules (de novo design) that are drug-like with > RDKit. I have my molecules in SMILES now and I need to check them with > AutoDock Vina. I have never used it and I have been trying since last week > but I kind of don’t know where to go from here. > > What is my config file, ligand or receptor? Do I need MGL Tools, PyMOL or > something else? > > Also, I couldn’t run it on my mac - Big Sur, I tried with a VirtualBox but > it didn’t work out either. I am thinking about installing Autodock Vina on > my old windows laptop now. Appreciate any help with this tool. Thanks in > advance. > > > Best, > > Velik Velikov > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] 2021.03.1 RDKit Release
Thak a lot to Greg and all contributors for the continuing development of this project! -- Gustavo Seabra. On Fri, Mar 26, 2021 at 11:16 AM Greg Landrum wrote: > Dear all, > > I'm pleased to announce that the 2021.03 version of the RDKit is released. > We actually managed to get the .03 release done during March. Shocking! ;-) > The release notes are below.[1] > > The release files are on the github release page: > https://github.com/rdkit/rdkit/releases/tag/Release_2021_03_1 > The DOI for this release is: > https://doi.org/10.5281/zenodo.4639022 > > I do not plan to do conda builds for the Python wrappers in the rdkit > channel for this release. The builds done as part of the conda-forge > project are automated and cover more Python versions and operating systems > than I could ever hope to do manually. > Please install the rdkit using conda-forge: > conda install -c conda-forge rdkit > I believe that the conda-forge builds of the new version should appear > over the next couple of days. > > I hope to finish the conda builds of the PostgreSQL cartridge for linux > and the mac and have them available in the rdkit channel by later today > or tomorrow. > > The online version of the documentation at rdkit.org ( > http://rdkit.org/docs/index.html) has been updated. > > Thanks to everyone who submitted code, bug reports, and suggestions for > this release! > > Please let me know if you find any problems with the release or have > suggestions for the next one, which is scheduled for September/October 2021. > > Best Regards, > -greg > [1] We probably should figure out some way to make the release notes a bit > less verbose. ;-) > > > # Release_2021.03.1 > (Changes relative to Release_2020.09.1) > > ## Backwards incompatible changes > - The distance-geometry based conformer generation now by defaults > generates > trans(oid) conformations for amides, esters, and related structures. > This can > be toggled off with the `forceTransAmides` flag in EmbedParameters. Note > that > this change does not impact conformers created using one of the ET > versions. > (#3794) > - The conformer generator now uses symmetry by default when doing RMS > pruning. > This can be disabled using the `useSymmetryForPruning` flag in > EmbedParameters. (#3813) > - Double bonds with unspecified stereochemistry in the products of chemical > reactions now have their stereo set to STEREONONE instead of STEREOANY > (#3078) > - The MolToSVG() function has been moved from rdkit.Chem to rdkit.Chem.Draw > (#3696) > - There have been numerous changes to the RGroup Decomposition code which > change > the results. (#3767) > - In RGroup Decomposition, when onlyMatchAtRGroups is set to false, each > molecule > is now decomposed based on the first matching scaffold which adds/uses > the > least number of non-user-provided R labels, rather than simply the first > matching scaffold. > Among other things, this allows the code to provide the same results for > both > onlyMatchAtRGroups=true and onlyMatchAtRGroups=false when suitable > scaffolds > are provided without requiring the user to get overly concerned about the > input ordering of the scaffolds. (#3969) > - There have been numerous changes to > `GenerateDepictionMatching2DStructure()` (#3811) > - Setting the kekuleSmiles argument (doKekule in C++) to MolToSmiles will > now > cause the molecule to be kekulized before SMILES generation. Note that > this > can lead to an exception being thrown. Previously this argument would > only > write kekulized SMILES if the molecule had already been kekulized (#2788) > - Using the kekulize argument in the MHFP code will now cause the molecule > to be > kekulized before the fingerprint is generated. Note that becaues > kekulization > is not canonical, using this argument currently causes the results to > depend > on the input atom numbering. Note that this can lead to an exception > being > thrown. (#3942) > - Gradients for angle and torsional restraints in both UFF and MMFF were > computed > incorrectly, which could give rise to potential instability during > minimization. > As part of fixing this problem, force constants have been switched to > using > kcal/degree^2 units instead of kcal/rad^2 units, consistently with the > fact that > angle and dihedral restraints are specified in degrees. (#3975) > > ## Highlights > - MolDraw2D now does a much better job of handling query features like > common > query bond types, atom lists, variable attachment points, and link > nodes. It > also supports adding annotations at the molecule level, displaying > brackets > for Sgro
[Rdkit-discuss] Get conformers as independent mols?
Hi all, Could anyone please help me with ideas on how to visualize molecule conformers inside a Jupyter notebook? I generate the conformers, for example, using: AllChem.EmbedMultipleConfs(mol, numConfs=5) And would like to see them in 3D inside the notebook. I tried using NGLView(https://github.com/nglviewer/nglview), but it only shows what I believe is the first conformer in the molecule. How can I change the conformer shown? or maybe is there a way to convert the conformers to Mol objects? Any idea would be greatly appreciated. Thank you! -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] activate my-rdkit-env from python script
Well, I stand corrected. From Norwid's answer, it seems it may be possible to change environment during execution. Still, remember that this is the opposite of the idea of having environments! The whole idea of conda environments to have a contained space with all you need. If you need to change environment during runtime, it just means that your environment is missing something... -- Gustavo Seabra From: Jeff Saxon Sent: Wednesday, December 2, 2020 9:29:37 AM To: Gustavo Seabra ; rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] activate my-rdkit-env from python script Right, many thanks! Yes, each time, before I run any script using python with the conda that I used to install RDKIT, I have to source the proper environment directly to bash, after which everything works correctly.. btw, why #subprocess.run('conda activate my-rdkit-env', shell=True) did not work? I thought it would be the same as the aforementioned step, but it asks me To initialize your shell, run $ conda init Currently supported shells are: - bash - fish - tcsh - xonsh - zsh - powershell ср, 2 дек. 2020 г. в 14:25, Gustavo Seabra : > > I don't believe that it is possible. You have to run your script from within > the environment where you installed rdkit. > > What I actually do is to have a work environment, and then install all the > packages I need in this same env. > > -- > Gustavo Seabra > > > From: Jeff Saxon > Sent: Wednesday, December 2, 2020 6:48:47 AM > To: rdkit-discuss@lists.sourceforge.net > Subject: [Rdkit-discuss] activate my-rdkit-env from python script > > Dear All, > > Since I installed RDKIT using conda, I have to use the following > command from my bash terminal to activate the RDKIT environment: > conda activate my-rdkit-env > How can I do the same but inside my python script? > I have already tried to call subprocess, but it did not work > # source environment from python script; > subprocess.run('conda init bash', shell=True) > subprocess.run('conda activate my-rdkit-env', shell=True) > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
Great, I'm glad it works for you now. As for the fikes that don't work, you could try loading them individually to look into them, or save the molecules again. If you could share the molecules here, maybe someone could find what is the problem. (I'd recommend starting a new thread for it) All the best, Gustavo. -- Gustavo Seabra From: Jeff Saxon Sent: Wednesday, December 2, 2020 9:37:01 AM To: Gustavo Seabra ; rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set Thank you again, Gustato! Here is how I adopted your script for multi-SDF filles: Note that I added directly to the script, a new datafile called 'All', into which I append each of the datafiles produced by your function using FOR loop .. Also I added TRY statement within FOR loop to ignore these two SDF caused a problem. However, I have no idea why they don't work (there are 2 filles from 1000, which in Pymol looks fine!) import subprocess, os, glob, shutil, sys import pandas as pd from rdkit import Chem, DataStructs from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors, AllChem from IPython.display import HTML # the main function def load_sdf_file(file, key): """ Reads molecules from an SDF file keeping only molecules with valid SMILES, and assign a source field """ df = PandasTools.LoadSDF(file) df['LIGAND'] = key #df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles) df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) df['HBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) df['HBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD) df = df[['LIGAND','LogP','MolWt','HBA','HBD']] return df pwd = os.getcwd() filles='sdf' results='results' #set directory to analyse data = os.path.join(pwd,filles) #set directory with outputs results = os.path.join(pwd,results) os.chdir(data) all = pd.DataFrame() for sdf in dirlist: try: sdf_name=sdf.rsplit( ".", 1 )[ 0 ] key = f'{sdf_name}' df = load_sdf_file(sdf,key) all = all.append(df,ignore_index = True) print(f'{sdf_name}.sdf has been processed') except: print(f'{sdf_name}.sdf has not been processed') # make a log of broken sdf filles with open(results+"/log.txt", "a") as log: log.write("%s has not been processed\n" %(key)) ср, 2 дек. 2020 г. в 13:55, Gustavo Seabra : > > Yes, the way it is written it will only keep the last sdf file read. I can > think of 2 options: > > 1. You can concatenate all sdfs into one, multi-molecule file: > $ cat *.sdf > multi.sdf > > And read this one. > > 2. Alternatively, instead of overwriting the final pandas dataframe every > time, you can create one initial df then only concatenate it with the results > of the function (see > https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) > > data = > pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD]) > > Then, for each file: > data = data.append(load_sdf_file(sdf,key)) > > If possible, I believe option (1) should be faster. > > > As for the error you are seeing, sometimes RDKit cannot read a molecule, so > it returns no 'ROMol' object. It usually happens when the molecule is > ill-defined. If you really need to read the molecules one-by-one, then you > will need to treat this situation maybe with an 'if' statement in the > function. If you read a multi-molecule sdf, it just ignores the molecules it > can't read and keeps going. > > Ah, I dont think there is a function to use pdb files with Pandas. SDF is a > better format for small molecules, anyway. > > All the best, > > -- > Gustavo Seabra > > > From: Jeff Saxon > Sent: Wednesday, December 2, 2020 4:53:05 AM > To: Gustavo Seabra ; > rdkit-discuss@lists.sourceforge.net > Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set > > Hey Gustavo, > > Thank you very much for your script! > I need to specify that I am working with many SDF filles, each of > which consist of one 3D structure of the ligand ( I don't see any > difference here between pdb, so if I can apply it on PDB directly it > would be rather better!!) > Anyway I've just tried to adapt you script for my case > > # I simplify the function to take only 4 properties required for > lipinsky calculations, > # I also substitute Source on the name of the particular SDF file (See below) > def load_sdf_file(file, key): > """ > Reads molecules from an SDF file keeping only molecules > with valid SMILES, and assign a source field > """ > df = PandasTools.LoadSDF(file) > df['Source'] = key > df['LogP'] = df['ROM
Re: [Rdkit-discuss] activate my-rdkit-env from python script
I don't believe that it is possible. You have to run your script from within the environment where you installed rdkit. What I actually do is to have a work environment, and then install all the packages I need in this same env. -- Gustavo Seabra From: Jeff Saxon Sent: Wednesday, December 2, 2020 6:48:47 AM To: rdkit-discuss@lists.sourceforge.net Subject: [Rdkit-discuss] activate my-rdkit-env from python script Dear All, Since I installed RDKIT using conda, I have to use the following command from my bash terminal to activate the RDKIT environment: conda activate my-rdkit-env How can I do the same but inside my python script? I have already tried to call subprocess, but it did not work # source environment from python script; subprocess.run('conda init bash', shell=True) subprocess.run('conda activate my-rdkit-env', shell=True) ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
Yes, the way it is written it will only keep the last sdf file read. I can think of 2 options: 1. You can concatenate all sdfs into one, multi-molecule file: $ cat *.sdf > multi.sdf And read this one. 2. Alternatively, instead of overwriting the final pandas dataframe every time, you can create one initial df then only concatenate it with the results of the function (see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) data = pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD]) Then, for each file: data = data.append(load_sdf_file(sdf,key)) If possible, I believe option (1) should be faster. As for the error you are seeing, sometimes RDKit cannot read a molecule, so it returns no 'ROMol' object. It usually happens when the molecule is ill-defined. If you really need to read the molecules one-by-one, then you will need to treat this situation maybe with an 'if' statement in the function. If you read a multi-molecule sdf, it just ignores the molecules it can't read and keeps going. Ah, I dont think there is a function to use pdb files with Pandas. SDF is a better format for small molecules, anyway. All the best, -- Gustavo Seabra From: Jeff Saxon Sent: Wednesday, December 2, 2020 4:53:05 AM To: Gustavo Seabra ; rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set Hey Gustavo, Thank you very much for your script! I need to specify that I am working with many SDF filles, each of which consist of one 3D structure of the ligand ( I don't see any difference here between pdb, so if I can apply it on PDB directly it would be rather better!!) Anyway I've just tried to adapt you script for my case # I simplify the function to take only 4 properties required for lipinsky calculations, # I also substitute Source on the name of the particular SDF file (See below) def load_sdf_file(file, key): """ Reads molecules from an SDF file keeping only molecules with valid SMILES, and assign a source field """ df = PandasTools.LoadSDF(file) df['Source'] = key df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) df['LipinskyHBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) df['LipinskyHBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD) df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']] return df pwd = os.getcwd() filles='sdf' results='results' #set directory to analyse data = os.path.join(pwd,filles) #set directory with outputs results = os.path.join(pwd,results) # go to the folder with all SDF filles os.chdir(data) # loop each SDF and use it with the function for sdf in dirlist: sdf_name=sdf.rsplit( ".", 1 )[ 0 ] key = f'{sdf_name}' df = load_sdf_file(sdf,key) print(f'{sdf_name}.sdf has been processed') The problem is that it always stores the last line within DF, while I need rather to append each processed SDF file. Also I've got an error on one of the sdf file which interrupted the script: Traceback (most recent call last): File "./lipinski2.py", line 67, in df = load_sdf_file(sdf,key) File "./lipinski2.py", line 26, in load_sdf_file df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) File "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/frame.py", line 2906, in __getitem__ indexer = self.columns.get_loc(key) File "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc raise KeyError(key) from err KeyError: 'ROMol' Probably some additional IF statement is required to ignore the file in the case of "broken" SDF... вт, 1 дек. 2020 г. в 19:07, Gustavo Seabra : > > Hi Jeff, > > > > There's a lot f people here with way more experience than me, so this may not > be the optimal solution... But here is what I would do in this case: > > > > from rdkit import Chem, DataStructs > > from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors > > from IPython.display import HTML > > > > def load_sdf_file(file,source,id_column): > > """ > > Reads molecules from an SDF file keeping only molecules > > with valid SMILES, and assign a source field > > """ > > df = PandasTools.LoadSDF(file) > > df['Source'] = source > > df['ID'] = df[id_column] > > df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles) > > df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) > > df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) > > df['LipinskyHBA'] = > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) > >
Re: [Rdkit-discuss] Partial substructure match?
Thank you so much! What I ended up doing follows the same basic idea, although not even close to the level of detail you put in your program. I'm only comparing the structures in pairs, and doing the following: (Sorry for the mess - its part of a larger system I just copied the relevant parts.) def scaffold_matching(query_smi, scaff_smi): """ Checks if the scaffold from scaff_smi is contained in the query_smi. Uses a stringent scaffold test. """ sca = Chem.MolFromSmiles(scaff_smi) que = Chem.MolFromSmiles(query_smi) match = 0 if que is not None: maxMatch = sca.GetNumAtoms() match = rdFMCS.FindMCS([sca,que], atomCompare=rdFMCS.AtomCompare.CompareAny, bondCompare=rdFMCS.BondCompare.CompareOrder, ringMatchesRingOnly=True, completeRingsOnly=True, ).numAtoms / maxMatch return match if __name__ == "__main__": template_smiles= query_smiles= template_mol = Chem.MolFromSmiles(template_smiles) core = MurckoScaffold.GetScaffoldForMol(template_mol) scaffold = Chem.MolToSmiles(core) match = scaffold_matching(query_smiles,scaffold) -- Gustavo Seabra From: Andrew Dalke Sent: Monday, November 23, 2020 7:59 AM To: Gustavo Seabra Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] Partial substructure match? On Nov 19, 2020, at 17:48, Gustavo Seabra mailto:gustavo.sea...@gmail.com> > wrote: Is it possible to search for *partial* substructure matches using RDKit? ... For example, if the pattern is a naphthalene and the molecule to search has a benzene, that would count as a 60% match. A number of people pointed out that RDKit's MCS feature might be appropriate. I've attached an example program based around that. For example, the default is your two structures: % python mcs_search.py No --query specified, using naphthalene as the default. No --target or --targets specified, using phenol as the default. Target_ID: phenol nAtoms: 7 nBonds: 7 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.600 bond_overlap: 0.545 atom_Tanimoto: 0.545 bond_Tanimoto: 0.500 I'll reverse it by specifying the SMILES on the command-line. % python mcs_search.py --query 'c1c1O' --target 'c1ccc2c2c1' Target_ID: query nAtoms: 10 nBonds: 11 match_nAtoms: 6 match_nBonds: 6 atom_overlap: 0.857 bond_overlap: 0.857 atom_Tanimoto: 0.545 bond_Tanimoto: 0.500 The program includes options to configure the FindMCS() parameters. In addition, if chemfp 3.x is installed then some additional features are available, like the following example, which applies the MCS search to all records in ChEBI: % python mcs_search.py --query 'COC(=O)C1C(OC(=O)c2c2)CC2CCC1N2C' --targets ~/databases/ChEBI_lite.sdf.gz --id-tag 'ChEBI ID' Target_IDnAtoms nBonds match_nAtoms match_nBonds atom_overlap bond_overlap atom_Tanimoto bond_Tanimoto CHEBI:776 21 24 9 8 0.409 0.333 0.265 0.200 CHEBI:1148 7 6 6 5 0.273 0.208 0.261 0.200 CHEBI:1734 19 21 16 15 0.727 0.625 0.640 0.500 CHEBI:1895 9 9 9 8 0.409 0.333 0.409 0.320 ... On Nov 20, 2020, at 15:56, Gustavo Seabra mailto:gustavo.sea...@gmail.com> > wrote: Is it possible to get a partial match with substructure search? No. Andrew da...@dalkescientific.com <mailto:da...@dalkescientific.com> ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Partial substructure match?
Hi Adelene, Doesn't the substructure match only works for the whole substructure, as an all-or-nothing? I suppose I could use the MCSS and count the number of matching atoms, then calculate the percentage match myself. Is it possible to get a partial match with substructure search? Gustavo. -- Gustavo Seabra From: Adelene LAI Sent: Friday, November 20, 2020 9:13:15 AM To: Dan Nealschneider ; Gustavo Seabra Cc: RDKit Discuss Subject: Re: [Rdkit-discuss] Partial substructure match? Hi Dan and Gustavo, MCSS sounds good, but depends on the goal. >From the way Gustavo wrote, it sounds like a Query-Target substructure search >- he has a list of targets and one specific query, and he wants to compare >matching rate amongst the members of the list. If so, I would try query SMARTS. https://www.rdkit.org/docs/GettingStartedInPython.html#substructure-searching Regarding the % substructure match, interesting question. How would you quantify that? Not sure such a thing exists in RDKit right now. Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG Campus Belval | Luxembourg Centre for Systems Biomedicine 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Dan Nealschneider Sent: Thursday, November 19, 2020 6:01:37 PM To: Gustavo Seabra Cc: RDKit Discuss Subject: Re: [Rdkit-discuss] Partial substructure match? Gustavo - That sounds like the "maximum common substructure" problem. Here's the relevant section in RDKit's "Getting started in Python" https://www.rdkit.org/docs/GettingStartedInPython.html#maximum-common-substructure dan nealschneider | lead developer [Schrodinger Logo]<https://www.schrodinger.com/> On Thu, Nov 19, 2020 at 8:50 AM Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: Hi all, Is it possible to search for *partial* substructure matches using RDKit? I'm aware of "HasSubstructMatch/ GetSubstructMatch", but my impression is that it only returns full matches (100%) of the required pattern in a structure. However, what I'd like to do is a bit different: Imagine I have one specific substructure (scaffold), and I'd like to search for molecules that have the full substructure *or part of it*, and maybe get the percentage of the substructure match? (100% = the full substructure is contained in the molecule). For example, if the pattern is a naphthalene and the molecule to search has a benzene, that would count as a 60% match. Is there a way to do that in RDKit? Thanks a lot! -- Gustavo Seabra ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Partial substructure match?
Hi all, Is it possible to search for *partial* substructure matches using RDKit? I'm aware of "HasSubstructMatch/ GetSubstructMatch", but my impression is that it only returns full matches (100%) of the required pattern in a structure. However, what I'd like to do is a bit different: Imagine I have one specific substructure (scaffold), and I'd like to search for molecules that have the full substructure *or part of it*, and maybe get the percentage of the substructure match? (100% = the full substructure is contained in the molecule). For example, if the pattern is a naphthalene and the molecule to search has a benzene, that would count as a 60% match. Is there a way to do that in RDKit? Thanks a lot! -- Gustavo Seabra ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Sure, here is: 1. The question: "I noticed that compounds that differ only on the cis-trans isomerization > around an sp2 nitrogen get the same InChI Key from RDKit. For example: > > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > > inchi_cis > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > > inchi_trans > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > inchi_cis == inchi_trans > True > I wonder if this is a limitation of the InChI Key definition, or an > implementation issue. There answer to the question, in the end, was that the InChI Keys were behaving as intended, by design, as pointed out by Igor Pletnev: though InChI is not perfect, in this case it behaves as intended. > Please see below. > The discussed molecules contain substituted guanidine fragment > (RHN)C(=NMe)(NHR') > It is subjected to tautomerism, and in different tautomers different C-N > bonds have double order: > (RHN)C(=NMe)(NHR') > (RHN)C(NHMe)(=NR') > (RN=)C(NHMe)(NHR') > You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in > the examples. > Standard InChI is specifically designed to produce the same identifier for > all tautomers (by indicating that two hydrogens are shared by three > nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). > As the tautomer-invariant Std InChI does not know which C-N bond is > actually a double, there is the only option for treating stereo -- to > completely ignore it as a drawing artifact. > All in all: > Standard InChI means that the exact tautomeric form is unknown ==> all > tautomers are mapped to the same generic representation ==> the exact C-N > double bond placement in this generic is unspecified ==> C-N double bond > stereo is ignored ==> generated StdInChI and Std InChIKey are the same for > seemingly different, by initial drawing, cis/trans forms. > Once again, this behavior is by design; it is intended for maximal > interoperability while comparing different drawings of the "same" compound. > If, for any reason, you would like to consider your examples as the > definite and resolvable structures, each having its own identifier, just > use non-Standard InChI. > The InChI which preserves the exact positions of tautomeric H's and double > bond ("as drawn") is produced by just specifying option /FixedH upon > generation. > More on this may be found in InChI FAQ: > https://www.inchi-trust.org/technical-faq-2/ The only question remaining was how to use this "/FixedH" option in RDKit, and that was answered by Paolo Tosco: you can pass InChI options to the underlying InChI API through the options parameter > of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); e.g.: > inchi.MolToInchi(mol, options="/FixedH") > Source: > https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi And this is what I'm using now to remove duplicate molecules from my database. I'm using a Pandas DataFrame and, with the more recent versions of Pandas, the following works fine: > df['InChI Key'] = df[mol_col].progress_apply(Chem.MolToInchiKey, options="/FixedH") > df.drop_duplicates(subset=['InChI Key'], keep='first', inplace=True) All the best, -- Gustavo Seabra. On Fri, Oct 30, 2020 at 4:47 AM Adelene LAI wrote: > Hi Gustavo, > > > Looks like you found a solution for your deduplication task. Would you > mind sharing it with us? (Seems some emails in the chain are missing.) > > > I'm curious - returning to your original question, did we figure out why > the same InChIKey was given for the stereoisomers? > > > Adelene > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > -- > *From:* Gustavo Seabra > *Sent:* Thursday, October 29, 2020 10:23:20 PM > *To:* Paolo Tosco > *Cc:* Igor Pletnev; RDKit Discuss > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Aha! Fantastic! > > Thanks a lot!! > Gustavo. > > -- > Gustavo Seabra > > -- > *From:* Paolo Tosco > *Sent:* Thursday, October 29, 2020 5:13:33 PM > *To:* Gustavo Seabra > *Cc:* Igor Pletnev ; RDKit Discuss < > rdkit-discuss@lists.sourceforge.net> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Hi Gusta
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Aha! Fantastic! Thanks a lot!! Gustavo. -- Gustavo Seabra From: Paolo Tosco Sent: Thursday, October 29, 2020 5:13:33 PM To: Gustavo Seabra Cc: Igor Pletnev ; RDKit Discuss Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, you can pass InChI options to the underlying InChI API through the options parameter of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); e.g.: inchi.MolToInchi(mol, options="/FixedH") Source: https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi Cheers, p. On Thu, Oct 29, 2020 at 9:42 PM Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: Ok, thanks! -- Gustavo Seabra. On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev mailto:igor.plet...@gmail.com>> wrote: > Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in > the docs). Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway, this option is available in InChI API calls, and I am pretty sure that it is also available in RDKit. I recall that couple of years ago, on some InChI event, Greg Landrum somewhat surprised me by saying that he himself often uses non-Standard InChI instead of Standard one — exactly to distinguish tautomers. So I guess Greg can answer on how it is arranged in RDKit. Regards, Igor On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: That does make sense, I understand it now, thanks! Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the docs). Thanks, -- Gustavo Seabra. On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev mailto:igor.plet...@gmail.com>> wrote: Hi Gustavo, > ... I was generating the InChI Keys to get a unique hash for each compound, > thinking it would be better than SMILES (guaranteed to be unique), but is > clearly not the case. On the bright side, I won't lose time generating > InChIs... though InChI is not perfect, in this case it behaves as intended. Please see below. The discussed molecules contain substituted guanidine fragment (RHN)C(=NMe)(NHR') It is subjected to tautomerism, and in different tautomers different C-N bonds have double order: (RHN)C(=NMe)(NHR') (RHN)C(NHMe)(=NR') (RN=)C(NHMe)(NHR') You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in the examples. Standard InChI is specifically designed to produce the same identifier for all tautomers (by indicating that two hydrogens are shared by three nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). As the tautomer-invariant Std InChI does not know which C-N bond is actually a double, there is the only option for treating stereo -- to completely ignore it as a drawing artifact. All in all: Standard InChI means that the exact tautomeric form is unknown ==> all tautomers are mapped to the same generic representation ==> the exact C-N double bond placement in this generic is unspecified ==> C-N double bond stereo is ignored ==> generated StdInChI and Std InChIKey are the same for seemingly different, by initial drawing, cis/trans forms. Once again, this behavior is by design; it is intended for maximal interoperability while comparing different drawings of the "same" compound. If, for any reason, you would like to consider your examples as the definite and resolvable structures, each having its own identifier, just use non-Standard InChI. The InChI which preserves the exact positions of tautomeric H's and double bond ("as drawn") is produced by just specifying option /FixedH upon generation. More on this may be found in InChI FAQ: https://www.inchi-trust.org/technical-faq-2/ Hope this helps. Regards, Igor On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: Thanks a lot Peter and Adelene, Yes, it looks like canonical SMILES is the way to go, and I have no problem sticking with RDKit. I was generating the InChI Keys to get a unique hash for each compound, thinking it would be better than SMILES (guaranteed to be unique), but is clearly not the case. On the bright side, I won't lose time generating InChIs... Can I trust that the same molecule will always get the same canonical SMILES from RDKit, independent of how it is read? (Different SDF files, geometries, atom orders, etc.?) All the best, Gustavo. -- Gustavo Seabra. On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin mailto:shen...@gmail.com>> wrote: Canonical SMILES is probably the way to go, but you might also be able to use the InchiKey and the Inchi auxiliary information together as a compound hash key. -P. On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI mailto:adelene@uni.lu>> wrote: Hi Gustavo, (Sorry, forgot to reply all before...) Your deduplication task is quite fa
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Ok, thanks! -- Gustavo Seabra. On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev wrote: > > Is this "/FixedH" an option in RDKit? How to use that? (I don't see it > in the docs). > > Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway, > this option is available in InChI API calls, and I am pretty sure that it > is also available in RDKit. > > I recall that couple of years ago, on some InChI event, Greg Landrum > somewhat surprised me by saying that he himself often uses non-Standard > InChI instead of Standard one — exactly to distinguish tautomers. > So I guess Greg can answer on how it is arranged in RDKit. > > Regards, > Igor > > > > > > On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra > wrote: > >> That does make sense, I understand it now, thanks! >> >> Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in >> the docs). >> >> Thanks, >> -- >> Gustavo Seabra. >> >> >> On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev >> wrote: >> >>> Hi Gustavo, >>> >>> > ... I was generating the InChI Keys to get a unique hash for each >>> compound, thinking it would be better than SMILES (guaranteed to be >>> unique), but is clearly not the case. On the bright side, I won't lose time >>> generating InChIs... >>> >>> though InChI is not perfect, in this case it behaves as intended. >>> Please see below. >>> >>> The discussed molecules contain substituted guanidine fragment >>> (RHN)C(=NMe)(NHR') >>> >>> It is subjected to tautomerism, and in different tautomers different C-N >>> bonds have double order: >>> (RHN)C(=NMe)(NHR') >>> (RHN)C(NHMe)(=NR') >>> (RN=)C(NHMe)(NHR') >>> >>> You generated Standard InChI, which is evidenced by "InChI=1S/" prefix >>> in the examples. >>> Standard InChI is specifically designed to produce the same identifier >>> for all tautomers (by indicating that two hydrogens are shared by three >>> nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). >>> >>> As the tautomer-invariant Std InChI does not know which C-N bond is >>> actually a double, there is the only option for treating stereo -- to >>> completely ignore it as a drawing artifact. >>> >>> All in all: >>> Standard InChI means that the exact tautomeric form is unknown ==> all >>> tautomers are mapped to the same generic representation ==> the exact C-N >>> double bond placement in this generic is unspecified ==> C-N double bond >>> stereo is ignored ==> generated StdInChI and Std InChIKey are the same for >>> seemingly different, by initial drawing, cis/trans forms. >>> >>> Once again, this behavior is by design; it is intended for maximal >>> interoperability while comparing different drawings of the "same" compound. >>> >>> If, for any reason, you would like to consider your examples as the >>> definite and resolvable structures, each having its own identifier, just >>> use non-Standard InChI. >>> The InChI which preserves the exact positions of tautomeric H's and >>> double bond ("as drawn") is produced by just specifying option /FixedH upon >>> generation. >>> >>> More on this may be found in InChI FAQ: >>> https://www.inchi-trust.org/technical-faq-2/ >>> >>> Hope this helps. >>> >>> Regards, >>> Igor >>> >>> >>> >>> On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra >>> wrote: >>> >>>> Thanks a lot Peter and Adelene, >>>> >>>> Yes, it looks like canonical SMILES is the way to go, and I have no >>>> problem sticking with RDKit. I was generating the InChI Keys to get a >>>> unique hash for each compound, thinking it would be better than SMILES >>>> (guaranteed to be unique), but is clearly not the case. On the bright side, >>>> I won't lose time generating InChIs... >>>> >>>> Can I trust that the same molecule will always get the same canonical >>>> SMILES from RDKit, independent of how it is read? (Different SDF files, >>>> geometries, atom orders, etc.?) >>>> >>>> All the best, >>>> Gustavo. >>>> >>>> >>>> -- >>>> Gustavo Seabra. >>>> >>>> >>>> On Sun, Oct 25, 2020 at 8:27
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
That does make sense, I understand it now, thanks! Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the docs). Thanks, -- Gustavo Seabra. On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev wrote: > Hi Gustavo, > > > ... I was generating the InChI Keys to get a unique hash for each > compound, thinking it would be better than SMILES (guaranteed to be > unique), but is clearly not the case. On the bright side, I won't lose time > generating InChIs... > > though InChI is not perfect, in this case it behaves as intended. > Please see below. > > The discussed molecules contain substituted guanidine fragment > (RHN)C(=NMe)(NHR') > > It is subjected to tautomerism, and in different tautomers different C-N > bonds have double order: > (RHN)C(=NMe)(NHR') > (RHN)C(NHMe)(=NR') > (RN=)C(NHMe)(NHR') > > You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in > the examples. > Standard InChI is specifically designed to produce the same identifier for > all tautomers (by indicating that two hydrogens are shared by three > nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). > > As the tautomer-invariant Std InChI does not know which C-N bond is > actually a double, there is the only option for treating stereo -- to > completely ignore it as a drawing artifact. > > All in all: > Standard InChI means that the exact tautomeric form is unknown ==> all > tautomers are mapped to the same generic representation ==> the exact C-N > double bond placement in this generic is unspecified ==> C-N double bond > stereo is ignored ==> generated StdInChI and Std InChIKey are the same for > seemingly different, by initial drawing, cis/trans forms. > > Once again, this behavior is by design; it is intended for maximal > interoperability while comparing different drawings of the "same" compound. > > If, for any reason, you would like to consider your examples as the > definite and resolvable structures, each having its own identifier, just > use non-Standard InChI. > The InChI which preserves the exact positions of tautomeric H's and double > bond ("as drawn") is produced by just specifying option /FixedH upon > generation. > > More on this may be found in InChI FAQ: > https://www.inchi-trust.org/technical-faq-2/ > > Hope this helps. > > Regards, > Igor > > > > On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra > wrote: > >> Thanks a lot Peter and Adelene, >> >> Yes, it looks like canonical SMILES is the way to go, and I have no >> problem sticking with RDKit. I was generating the InChI Keys to get a >> unique hash for each compound, thinking it would be better than SMILES >> (guaranteed to be unique), but is clearly not the case. On the bright side, >> I won't lose time generating InChIs... >> >> Can I trust that the same molecule will always get the same canonical >> SMILES from RDKit, independent of how it is read? (Different SDF files, >> geometries, atom orders, etc.?) >> >> All the best, >> Gustavo. >> >> >> -- >> Gustavo Seabra. >> >> >> On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin >> wrote: >> >>> Canonical SMILES is probably the way to go, but you might also be able >>> to use the InchiKey and the Inchi auxiliary information together as a >>> compound hash key. >>> >>> -P. >>> >>> On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI wrote: >>> >>>> Hi Gustavo, >>>> >>>> >>>> (Sorry, forgot to reply all before...) >>>> >>>> >>>> Your deduplication task is quite familiar to me and something I do >>>> quite a lot of in my own work ;) >>>> >>>> >>>> Can I suggest deduplicating using Canonical SMILES? >>>> >>>> >>>> It doesn't solve your InChIKey issue, but it is a solution for now. >>>> >>>> >>>> I updated my gist to show that it is feasible: >>>> >>>> >>>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >>>> >>>> >>>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >>>> >>>> Adelene >>>> >>>> >>>> >>>> Doctoral Researcher >>>> >>>> Environmental Cheminformatics >>>> >>>> UNIVERSITÉ DU LUXEMBOURG >>>> >>>> >>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Thanks a lot Peter and Adelene, Yes, it looks like canonical SMILES is the way to go, and I have no problem sticking with RDKit. I was generating the InChI Keys to get a unique hash for each compound, thinking it would be better than SMILES (guaranteed to be unique), but is clearly not the case. On the bright side, I won't lose time generating InChIs... Can I trust that the same molecule will always get the same canonical SMILES from RDKit, independent of how it is read? (Different SDF files, geometries, atom orders, etc.?) All the best, Gustavo. -- Gustavo Seabra. On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin wrote: > Canonical SMILES is probably the way to go, but you might also be able to > use the InchiKey and the Inchi auxiliary information together as a compound > hash key. > > -P. > > On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI wrote: > >> Hi Gustavo, >> >> >> (Sorry, forgot to reply all before...) >> >> >> Your deduplication task is quite familiar to me and something I do quite >> a lot of in my own work ;) >> >> >> Can I suggest deduplicating using Canonical SMILES? >> >> >> It doesn't solve your InChIKey issue, but it is a solution for now. >> >> >> I updated my gist to show that it is feasible: >> >> >> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >> >> >> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >> >> Adelene >> >> >> >> Doctoral Researcher >> >> Environmental Cheminformatics >> >> UNIVERSITÉ DU LUXEMBOURG >> >> >> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >> >> 6, avenue du Swing, L-4367 Belvaux >> >> T +356 46 66 44 67 18 >> >> [image: github.png] adelenelai >> >> >> >> >> >> -- >> *From:* Gustavo Seabra >> *Sent:* Sunday, October 25, 2020 2:27:15 PM >> *To:* Adelene LAI >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> Actually, I was trying to generate all stereoisomers for molecules in a >> database, and filter duplicate molecules by using the InChI Key to detect >> duplicates. But it gives cis/trans isomers on sp2-N the same Key. >> >> Gustavo. >> >> -- >> Gustavo Seabra >> >> -- >> *From:* Adelene LAI >> *Sent:* Sunday, October 25, 2020 1:44:01 AM >> *To:* Gustavo Seabra >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> >> Hi Gustavo, >> >> >> It occurred to me while swimming yesterday - was there a reason you >> pointed out the hybridisation state of N in your original subject text? >> >> >> Was it just to specify which N to focus on, or did you expect something >> special about sp2 hybridisation wrt InChIKey? >> >> >> Adelene >> >> >> Doctoral Researcher >> >> Environmental Cheminformatics >> >> UNIVERSITÉ DU LUXEMBOURG >> >> >> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >> >> 6, avenue du Swing, L-4367 Belvaux >> >> T +356 46 66 44 67 18 >> >> [image: github.png] adelenelai >> >> >> >> >> >> -- >> *From:* Gustavo Seabra >> *Sent:* Saturday, October 24, 2020 5:37:09 AM >> *To:* RDKit Discuss; Adelene LAI >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> Thanks for looking into it. I'm happy to see.it wasn't just a mistake by >> me ;-) >> >> I hope we can find what's wrong there. >> >> Best, >> Gustavo. >> >> -- >> Gustavo Seabra >> >> -- >> *From:* Adelene LAI >> *Sent:* Friday, October 23, 2020 11:28:55 PM >> *To:* Gustavo Seabra ; RDKit Discuss < >> rdkit-discuss@lists.sourceforge.net> >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> >> Hi Gustavo, >> >> >> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >> >> >> In the gist above, I tried doing some further investigating. >> >> >> It seems for the example you gave, the rdkit functions indeed give the >> same inchikey and inchi, but different aux info. >> >> >> Why this different aux info doesn't translate into di
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Thanks for looking into it. I'm happy to see.it wasn't just a mistake by me ;-) I hope we can find what's wrong there. Best, Gustavo. -- Gustavo Seabra From: Adelene LAI Sent: Friday, October 23, 2020 11:28:55 PM To: Gustavo Seabra ; RDKit Discuss Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f In the gist above, I tried doing some further investigating. It seems for the example you gave, the rdkit functions indeed give the same inchikey and inchi, but different aux info. Why this different aux info doesn't translate into different inchikeys/inchis, I'm not sure. Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Gustavo Seabra Sent: Friday, October 23, 2020 6:43:07 PM To: RDKit Discuss Subject: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi all, I run into an issue here, and I'd appreciate your input. I noticed that compounds that differ only on the cis-trans isomerization around an sp2 nitrogen get the same InChI Key from RDKit. For example: > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > inchi_cis 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > inchi_trans 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_cis == inchi_trans True I wonder if this is a limitation of the InChI Key definition, or an implementation issue. Thanks a lot, -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Hi all, I run into an issue here, and I'd appreciate your input. I noticed that compounds that differ only on the cis-trans isomerization around an sp2 nitrogen get the same InChI Key from RDKit. For example: > inchi_cis = Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > inchi_cis 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_trans = Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > inchi_trans 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_cis == inchi_trans True I wonder if this is a limitation of the InChI Key definition, or an implementation issue. Thanks a lot, -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Converting csv/xls file containing SMILES to .sdf
You can open the csv file directly into Schrodinger's Maestro. The free version can open CSV files. -- Gustavo Seabra From: ITS RDC Sent: Thursday, May 28, 2020 9:11:42 AM To: RDKit Discuss Subject: [Rdkit-discuss] Converting csv/xls file containing SMILES to .sdf Hi all, I have a list of compounds that I want to know their topological and molecular properties to be able to generate a model for QSAR. I have over a hundred compounds contained in an MS Excel file in csv format since we only downloaded these compounds from existing chemical databases that do not offer the sdf format. I think it is not convenient to manually open each compound in ChemDraw to pool all compounds. I am looking into PandasTools but the documentation only indicated that sdf can be converted to csv and not vice versa. Has anyone worked with similar task before? Your response is very much appreciated. Thank you. Joanna ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Multiline legend in MolsToGridImage
[image: Screenshot from 2020-04-08 16-28-37.png] Hi, Does anyone know how to write multiline legends when using MolsToGridImage? I've been trying the code [here]( https://sourceforge.net/p/rdkit/mailman/message/35561198/), but nothing there seems to work for me, as I only get a blank rectangle in place of the \n or \r symbols... (see picture) Are there any ideas? Thanks, -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit Chem.MolFromPDBFile ignores some files...
Thanks. Yes, I too understood that it should get the connectivity from the distances. I'm using PDB for it being the output from another program. I'll see what I can change then. Thanks, Gustavo. -- Gustavo Seabra From: Alan Kerstjens Medina Sent: Sunday, April 5, 2020 9:15:26 AM To: Gustavo Seabra ; rdkit-discuss@lists.sourceforge.net Subject: RE: [Rdkit-discuss] RDKit Chem.MolFromPDBFile ignores some files... Hi Gustavo, I haven’t looked into the RDKit source code for this but I assume this has to do with the lack of CONECT records in the PDB file you attached (i.e. you are only storing atom coordinates, not connectivity). >From what I could gather from the RDKit documentation, the default behaviour >for the MolFromPDBFile function is to “sense” bonds based on atom proximity >(proximityBonding=True), but I guess that isn’t happening. Maybe someone else >could chime in and clarify how to make that feature work as intended. Is there any particular reason you want to use PDB files for small molecules? They tend to be a bit of a headache and not particularly efficient storage-wise. If atom coordinates are important maybe it would be easier to use SDF or MOL2 files instead. Best regards, Alan From: Gustavo Seabra<mailto:gustavo.sea...@gmail.com> Sent: 04 April 2020 22:08 To: rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net> Subject: [Rdkit-discuss] RDKit Chem.MolFromPDBFile ignores some files... Hi all, I'm having another problem when reading a PDB file. Some files just return "None", with no error message at all. For example, the attached file: >>> Chem.MolFromPDBFile("./a3.pdb") Does not return a Mol object. Does anyone know what is wrong with this file? I can open it regularly in other programs. Is there any way to "force" rdkit to recognize the file? Thanks, -- Gustavo Seabra ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] RDKit Chem.MolFromPDBFile ignores some files...
Hi all, I'm having another problem when reading a PDB file. Some files just return "None", with no error message at all. For example, the attached file: >>> Chem.MolFromPDBFile("./a3.pdb") Does not return a Mol object. Does anyone know what is wrong with this file? I can open it regularly in other programs. Is there any way to "force" rdkit to recognize the file? Thanks, -- Gustavo Seabra a3.pdb Description: Binary data ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Help mapping atoms between two files
HI all, I'm trying to use get the substructure matches between two different PDB files with the same molecule, but different atom order and naming. However, GetSubstructMatches Just returns nothing, i.e. no matches (files attached): For example: >>> ref_mol = Chem.MolFromPDBFile(str("a1.pdb")) >>> tgt_mol = Chem.MolFromPDBFile(str("a2.pdb")) >>> ref_mol.GetNumAtoms(),tgt_mol.GetNumAtoms() (27, 27) >>> ref_mol.GetSubstructMatches(tgt_mol) () >>> ref_mol.HasSubstructMatch(tgt_mol) False Could anyone here suggest a different way to get the atom mapping between the two molecules? Thanks a lot, -- Gustavo Seabra a1.pdb Description: Binary data a2.pdb Description: Binary data ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] PandasTools LoadSDF: Different treatment of SMILES depending on presence of 'MOL' column?
Hi all, I'm trying to load a DrugBank library into a Pandas DataFrame, using two different possibilities: creating or not a 'mol' column during load. In principle I'm only interested in the SMILES, so creating the 'Mol' column should not be necessary. However, I noticed that the two procedures actually generate a different number of molecules, and the SMILES are not necessarily the same: 1. Creating 'Mol' column: 2,410 molecules 2. Not creating the 'Mol' column: 2,413 molecules I assumed the difference would be due to some molecules which RDKit could not generate the 'Mol' column for some reason and then just silently dropped the molecules. So, I tried to find out the difference between the sets by: >>> drugbank.merge(drugbank_nomol,how='outer',on='SMILES',indicator=True).loc[ lambda x: x['_merge'] == 'right_only'] Which, assuming the SMILES are the same, *should* be 3, but it returns 1865 records (!) meaning the SMILES are mostly different between the sets. Could someone help me figure out what is going on here? To avoid attach files here, I put a test database and a Jupyter Notebook with the example in here: https://www.dropbox.com/s/v8kf7vzpmrjkidl/RDKit_test.zip?dl=0 Thanks a lot! -- Gustavo Seabra ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss