Aha! Fantastic! Thanks a lot!! Gustavo.
-- Gustavo Seabra ________________________________ From: Paolo Tosco <paolo.tosco.m...@gmail.com> Sent: Thursday, October 29, 2020 5:13:33 PM To: Gustavo Seabra <gustavo.sea...@gmail.com> Cc: Igor Pletnev <igor.plet...@gmail.com>; RDKit Discuss <rdkit-discuss@lists.sourceforge.net> Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, you can pass InChI options to the underlying InChI API through the options parameter of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); e.g.: inchi.MolToInchi(mol, options="/FixedH") Source: https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi Cheers, p. On Thu, Oct 29, 2020 at 9:42 PM Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> wrote: Ok, thanks! -- Gustavo Seabra. On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev <igor.plet...@gmail.com<mailto:igor.plet...@gmail.com>> wrote: > Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in > the docs). Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway, this option is available in InChI API calls, and I am pretty sure that it is also available in RDKit. I recall that couple of years ago, on some InChI event, Greg Landrum somewhat surprised me by saying that he himself often uses non-Standard InChI instead of Standard one — exactly to distinguish tautomers. So I guess Greg can answer on how it is arranged in RDKit. Regards, Igor On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> wrote: That does make sense, I understand it now, thanks! Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the docs). Thanks, -- Gustavo Seabra. On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev <igor.plet...@gmail.com<mailto:igor.plet...@gmail.com>> wrote: Hi Gustavo, > ... I was generating the InChI Keys to get a unique hash for each compound, > thinking it would be better than SMILES (guaranteed to be unique), but is > clearly not the case. On the bright side, I won't lose time generating > InChIs... though InChI is not perfect, in this case it behaves as intended. Please see below. The discussed molecules contain substituted guanidine fragment (RHN)C(=NMe)(NHR') It is subjected to tautomerism, and in different tautomers different C-N bonds have double order: (RHN)C(=NMe)(NHR') (RHN)C(NHMe)(=NR') (RN=)C(NHMe)(NHR') You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in the examples. Standard InChI is specifically designed to produce the same identifier for all tautomers (by indicating that two hydrogens are shared by three nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). As the tautomer-invariant Std InChI does not know which C-N bond is actually a double, there is the only option for treating stereo -- to completely ignore it as a drawing artifact. All in all: Standard InChI means that the exact tautomeric form is unknown ==> all tautomers are mapped to the same generic representation ==> the exact C-N double bond placement in this generic is unspecified ==> C-N double bond stereo is ignored ==> generated StdInChI and Std InChIKey are the same for seemingly different, by initial drawing, cis/trans forms. Once again, this behavior is by design; it is intended for maximal interoperability while comparing different drawings of the "same" compound. If, for any reason, you would like to consider your examples as the definite and resolvable structures, each having its own identifier, just use non-Standard InChI. The InChI which preserves the exact positions of tautomeric H's and double bond ("as drawn") is produced by just specifying option /FixedH upon generation. More on this may be found in InChI FAQ: https://www.inchi-trust.org/technical-faq-2/ Hope this helps. Regards, Igor On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> wrote: Thanks a lot Peter and Adelene, Yes, it looks like canonical SMILES is the way to go, and I have no problem sticking with RDKit. I was generating the InChI Keys to get a unique hash for each compound, thinking it would be better than SMILES (guaranteed to be unique), but is clearly not the case. On the bright side, I won't lose time generating InChIs... Can I trust that the same molecule will always get the same canonical SMILES from RDKit, independent of how it is read? (Different SDF files, geometries, atom orders, etc.?) All the best, Gustavo. -- Gustavo Seabra. On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin <shen...@gmail.com<mailto:shen...@gmail.com>> wrote: Canonical SMILES is probably the way to go, but you might also be able to use the InchiKey and the Inchi auxiliary information together as a compound hash key. -P. On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI <adelene....@uni.lu<mailto:adelene....@uni.lu>> wrote: Hi Gustavo, (Sorry, forgot to reply all before...) Your deduplication task is quite familiar to me and something I do quite a lot of in my own work ;) Can I suggest deduplicating using Canonical SMILES? It doesn't solve your InChIKey issue, but it is a solution for now. I updated my gist to show that it is feasible: https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing<https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai ________________________________ From: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> Sent: Sunday, October 25, 2020 2:27:15 PM To: Adelene LAI Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Actually, I was trying to generate all stereoisomers for molecules in a database, and filter duplicate molecules by using the InChI Key to detect duplicates. But it gives cis/trans isomers on sp2-N the same Key. Gustavo. -- Gustavo Seabra ________________________________ From: Adelene LAI <adelene....@uni.lu<mailto:adelene....@uni.lu>> Sent: Sunday, October 25, 2020 1:44:01 AM To: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, It occurred to me while swimming yesterday - was there a reason you pointed out the hybridisation state of N in your original subject text? Was it just to specify which N to focus on, or did you expect something special about sp2 hybridisation wrt InChIKey? Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing<https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai ________________________________ From: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> Sent: Saturday, October 24, 2020 5:37:09 AM To: RDKit Discuss; Adelene LAI Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Thanks for looking into it. I'm happy to see.it<http://see.it> wasn't just a mistake by me ;-) I hope we can find what's wrong there. Best, Gustavo. -- Gustavo Seabra ________________________________ From: Adelene LAI <adelene....@uni.lu<mailto:adelene....@uni.lu>> Sent: Friday, October 23, 2020 11:28:55 PM To: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>>; RDKit Discuss <rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net>> Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f In the gist above, I tried doing some further investigating. It seems for the example you gave, the rdkit functions indeed give the same inchikey and inchi, but different aux info. Why this different aux info doesn't translate into different inchikeys/inchis, I'm not sure. Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing<https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai ________________________________ From: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> Sent: Friday, October 23, 2020 6:43:07 PM To: RDKit Discuss Subject: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi all, I run into an issue here, and I'd appreciate your input. I noticed that compounds that differ only on the cis-trans isomerization around an sp2 nitrogen get the same InChI Key from RDKit. For example: > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > inchi_cis 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > inchi_trans 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_cis == inchi_trans True I wonder if this is a limitation of the InChI Key definition, or an implementation issue. Thanks a lot, -- Gustavo Seabra. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss