Sure, here is: 1. The question:
"I noticed that compounds that differ only on the cis-trans isomerization > around an sp2 nitrogen get the same InChI Key from RDKit. For example: > > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > > inchi_cis > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > > inchi_trans > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > inchi_cis == inchi_trans > True > I wonder if this is a limitation of the InChI Key definition, or an > implementation issue. There answer to the question, in the end, was that the InChI Keys were behaving as intended, by design, as pointed out by Igor Pletnev: though InChI is not perfect, in this case it behaves as intended. > Please see below. > The discussed molecules contain substituted guanidine fragment > (RHN)C(=NMe)(NHR') > It is subjected to tautomerism, and in different tautomers different C-N > bonds have double order: > (RHN)C(=NMe)(NHR') > (RHN)C(NHMe)(=NR') > (RN=)C(NHMe)(NHR') > You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in > the examples. > Standard InChI is specifically designed to produce the same identifier for > all tautomers (by indicating that two hydrogens are shared by three > nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). > As the tautomer-invariant Std InChI does not know which C-N bond is > actually a double, there is the only option for treating stereo -- to > completely ignore it as a drawing artifact. > All in all: > Standard InChI means that the exact tautomeric form is unknown ==> all > tautomers are mapped to the same generic representation ==> the exact C-N > double bond placement in this generic is unspecified ==> C-N double bond > stereo is ignored ==> generated StdInChI and Std InChIKey are the same for > seemingly different, by initial drawing, cis/trans forms. > Once again, this behavior is by design; it is intended for maximal > interoperability while comparing different drawings of the "same" compound. > If, for any reason, you would like to consider your examples as the > definite and resolvable structures, each having its own identifier, just > use non-Standard InChI. > The InChI which preserves the exact positions of tautomeric H's and double > bond ("as drawn") is produced by just specifying option /FixedH upon > generation. > More on this may be found in InChI FAQ: > https://www.inchi-trust.org/technical-faq-2/ The only question remaining was how to use this "/FixedH" option in RDKit, and that was answered by Paolo Tosco: you can pass InChI options to the underlying InChI API through the options parameter > of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); e.g.: > inchi.MolToInchi(mol, options="/FixedH") > Source: > https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi And this is what I'm using now to remove duplicate molecules from my database. I'm using a Pandas DataFrame and, with the more recent versions of Pandas, the following works fine: > df['InChI Key'] = df[mol_col].progress_apply(Chem.MolToInchiKey, options="/FixedH") > df.drop_duplicates(subset=['InChI Key'], keep='first', inplace=True) All the best, -- Gustavo Seabra. On Fri, Oct 30, 2020 at 4:47 AM Adelene LAI <adelene....@uni.lu> wrote: > Hi Gustavo, > > > Looks like you found a solution for your deduplication task. Would you > mind sharing it with us? (Seems some emails in the chain are missing.) > > > I'm curious - returning to your original question, did we figure out why > the same InChIKey was given for the stereoisomers? > > > Adelene > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > ------------------------------ > *From:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Sent:* Thursday, October 29, 2020 10:23:20 PM > *To:* Paolo Tosco > *Cc:* Igor Pletnev; RDKit Discuss > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Aha! Fantastic! > > Thanks a lot!! > Gustavo. > > -- > Gustavo Seabra > > ------------------------------ > *From:* Paolo Tosco <paolo.tosco.m...@gmail.com> > *Sent:* Thursday, October 29, 2020 5:13:33 PM > *To:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Cc:* Igor Pletnev <igor.plet...@gmail.com>; RDKit Discuss < > rdkit-discuss@lists.sourceforge.net> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Hi Gustavo, > > you can pass InChI options to the underlying InChI API through the options > parameter of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); > e.g.: > > inchi.MolToInchi(mol, options="/FixedH") > > Source: > https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi > > Cheers, > p. > > On Thu, Oct 29, 2020 at 9:42 PM Gustavo Seabra <gustavo.sea...@gmail.com> > wrote: > > Ok, thanks! > -- > Gustavo Seabra. > > > On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev <igor.plet...@gmail.com> > wrote: > > > Is this "/FixedH" an option in RDKit? How to use that? (I don't see it > in the docs). > > Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway, > this option is available in InChI API calls, and I am pretty sure that it > is also available in RDKit. > > I recall that couple of years ago, on some InChI event, Greg Landrum > somewhat surprised me by saying that he himself often uses non-Standard > InChI instead of Standard one — exactly to distinguish tautomers. > So I guess Greg can answer on how it is arranged in RDKit. > > Regards, > Igor > > > > > > On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra <gustavo.sea...@gmail.com> > wrote: > > That does make sense, I understand it now, thanks! > > Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in > the docs). > > Thanks, > -- > Gustavo Seabra. > > > On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev <igor.plet...@gmail.com> > wrote: > > Hi Gustavo, > > > ... I was generating the InChI Keys to get a unique hash for each > compound, thinking it would be better than SMILES (guaranteed to be > unique), but is clearly not the case. On the bright side, I won't lose time > generating InChIs... > > though InChI is not perfect, in this case it behaves as intended. > Please see below. > > The discussed molecules contain substituted guanidine fragment > (RHN)C(=NMe)(NHR') > > It is subjected to tautomerism, and in different tautomers different C-N > bonds have double order: > (RHN)C(=NMe)(NHR') > (RHN)C(NHMe)(=NR') > (RN=)C(NHMe)(NHR') > > You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in > the examples. > Standard InChI is specifically designed to produce the same identifier for > all tautomers (by indicating that two hydrogens are shared by three > nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). > > As the tautomer-invariant Std InChI does not know which C-N bond is > actually a double, there is the only option for treating stereo -- to > completely ignore it as a drawing artifact. > > All in all: > Standard InChI means that the exact tautomeric form is unknown ==> all > tautomers are mapped to the same generic representation ==> the exact C-N > double bond placement in this generic is unspecified ==> C-N double bond > stereo is ignored ==> generated StdInChI and Std InChIKey are the same for > seemingly different, by initial drawing, cis/trans forms. > > Once again, this behavior is by design; it is intended for maximal > interoperability while comparing different drawings of the "same" compound. > > If, for any reason, you would like to consider your examples as the > definite and resolvable structures, each having its own identifier, just > use non-Standard InChI. > The InChI which preserves the exact positions of tautomeric H's and double > bond ("as drawn") is produced by just specifying option /FixedH upon > generation. > > More on this may be found in InChI FAQ: > https://www.inchi-trust.org/technical-faq-2/ > > Hope this helps. > > Regards, > Igor > > > > On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra <gustavo.sea...@gmail.com> > wrote: > > Thanks a lot Peter and Adelene, > > Yes, it looks like canonical SMILES is the way to go, and I have no > problem sticking with RDKit. I was generating the InChI Keys to get a > unique hash for each compound, thinking it would be better than SMILES > (guaranteed to be unique), but is clearly not the case. On the bright side, > I won't lose time generating InChIs... > > Can I trust that the same molecule will always get the same canonical > SMILES from RDKit, independent of how it is read? (Different SDF files, > geometries, atom orders, etc.?) > > All the best, > Gustavo. > > > -- > Gustavo Seabra. > > > On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin <shen...@gmail.com> > wrote: > > Canonical SMILES is probably the way to go, but you might also be able to > use the InchiKey and the Inchi auxiliary information together as a compound > hash key. > > -P. > > On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI <adelene....@uni.lu> wrote: > > Hi Gustavo, > > > (Sorry, forgot to reply all before...) > > > Your deduplication task is quite familiar to me and something I do quite a > lot of in my own work ;) > > > Can I suggest deduplicating using Canonical SMILES? > > > It doesn't solve your InChIKey issue, but it is a solution for now. > > > I updated my gist to show that it is feasible: > > > https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f > > > <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> > > Adelene > > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing > <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>, > L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > ------------------------------ > *From:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Sent:* Sunday, October 25, 2020 2:27:15 PM > *To:* Adelene LAI > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Actually, I was trying to generate all stereoisomers for molecules in a > database, and filter duplicate molecules by using the InChI Key to detect > duplicates. But it gives cis/trans isomers on sp2-N the same Key. > > Gustavo. > > -- > Gustavo Seabra > > ------------------------------ > *From:* Adelene LAI <adelene....@uni.lu> > *Sent:* Sunday, October 25, 2020 1:44:01 AM > *To:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > > Hi Gustavo, > > > It occurred to me while swimming yesterday - was there a reason you > pointed out the hybridisation state of N in your original subject text? > > > Was it just to specify which N to focus on, or did you expect something > special about sp2 hybridisation wrt InChIKey? > > > Adelene > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing > <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>, > L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > ------------------------------ > *From:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Sent:* Saturday, October 24, 2020 5:37:09 AM > *To:* RDKit Discuss; Adelene LAI > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Thanks for looking into it. I'm happy to see.it wasn't just a mistake by > me ;-) > > I hope we can find what's wrong there. > > Best, > Gustavo. > > -- > Gustavo Seabra > > ------------------------------ > *From:* Adelene LAI <adelene....@uni.lu> > *Sent:* Friday, October 23, 2020 11:28:55 PM > *To:* Gustavo Seabra <gustavo.sea...@gmail.com>; RDKit Discuss < > rdkit-discuss@lists.sourceforge.net> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > > Hi Gustavo, > > > <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> > https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f > > > In the gist above, I tried doing some further investigating. > > > It seems for the example you gave, the rdkit functions indeed give the > same inchikey and inchi, but different aux info. > > > Why this different aux info doesn't translate into different > inchikeys/inchis, I'm not sure. > > > Adelene > > > > > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing > <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>, > L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > ------------------------------ > *From:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Sent:* Friday, October 23, 2020 6:43:07 PM > *To:* RDKit Discuss > *Subject:* [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Hi all, > > I run into an issue here, and I'd appreciate your input. I noticed that > compounds that differ only on the cis-trans isomerization around an sp2 > nitrogen get the same InChI Key from RDKit. For example: > > > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > > inchi_cis > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > > inchi_trans > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > > inchi_cis == inchi_trans > True > > I wonder if this is a limitation of the InChI Key definition, or an > implementation issue. > > Thanks a lot, > > -- > Gustavo Seabra. > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss