That does make sense, I understand it now, thanks! Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the docs).
Thanks, -- Gustavo Seabra. On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev <igor.plet...@gmail.com> wrote: > Hi Gustavo, > > > ... I was generating the InChI Keys to get a unique hash for each > compound, thinking it would be better than SMILES (guaranteed to be > unique), but is clearly not the case. On the bright side, I won't lose time > generating InChIs... > > though InChI is not perfect, in this case it behaves as intended. > Please see below. > > The discussed molecules contain substituted guanidine fragment > (RHN)C(=NMe)(NHR') > > It is subjected to tautomerism, and in different tautomers different C-N > bonds have double order: > (RHN)C(=NMe)(NHR') > (RHN)C(NHMe)(=NR') > (RN=)C(NHMe)(NHR') > > You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in > the examples. > Standard InChI is specifically designed to produce the same identifier for > all tautomers (by indicating that two hydrogens are shared by three > nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). > > As the tautomer-invariant Std InChI does not know which C-N bond is > actually a double, there is the only option for treating stereo -- to > completely ignore it as a drawing artifact. > > All in all: > Standard InChI means that the exact tautomeric form is unknown ==> all > tautomers are mapped to the same generic representation ==> the exact C-N > double bond placement in this generic is unspecified ==> C-N double bond > stereo is ignored ==> generated StdInChI and Std InChIKey are the same for > seemingly different, by initial drawing, cis/trans forms. > > Once again, this behavior is by design; it is intended for maximal > interoperability while comparing different drawings of the "same" compound. > > If, for any reason, you would like to consider your examples as the > definite and resolvable structures, each having its own identifier, just > use non-Standard InChI. > The InChI which preserves the exact positions of tautomeric H's and double > bond ("as drawn") is produced by just specifying option /FixedH upon > generation. > > More on this may be found in InChI FAQ: > https://www.inchi-trust.org/technical-faq-2/ > > Hope this helps. > > Regards, > Igor > > > > On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra <gustavo.sea...@gmail.com> > wrote: > >> Thanks a lot Peter and Adelene, >> >> Yes, it looks like canonical SMILES is the way to go, and I have no >> problem sticking with RDKit. I was generating the InChI Keys to get a >> unique hash for each compound, thinking it would be better than SMILES >> (guaranteed to be unique), but is clearly not the case. On the bright side, >> I won't lose time generating InChIs... >> >> Can I trust that the same molecule will always get the same canonical >> SMILES from RDKit, independent of how it is read? (Different SDF files, >> geometries, atom orders, etc.?) >> >> All the best, >> Gustavo. >> >> >> -- >> Gustavo Seabra. >> >> >> On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin <shen...@gmail.com> >> wrote: >> >>> Canonical SMILES is probably the way to go, but you might also be able >>> to use the InchiKey and the Inchi auxiliary information together as a >>> compound hash key. >>> >>> -P. >>> >>> On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI <adelene....@uni.lu> wrote: >>> >>>> Hi Gustavo, >>>> >>>> >>>> (Sorry, forgot to reply all before...) >>>> >>>> >>>> Your deduplication task is quite familiar to me and something I do >>>> quite a lot of in my own work ;) >>>> >>>> >>>> Can I suggest deduplicating using Canonical SMILES? >>>> >>>> >>>> It doesn't solve your InChIKey issue, but it is a solution for now. >>>> >>>> >>>> I updated my gist to show that it is feasible: >>>> >>>> >>>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >>>> >>>> >>>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >>>> >>>> Adelene >>>> >>>> >>>> >>>> Doctoral Researcher >>>> >>>> Environmental Cheminformatics >>>> >>>> UNIVERSITÉ DU LUXEMBOURG >>>> >>>> >>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>> >>>> 6, avenue du Swing, L-4367 Belvaux >>>> >>>> T +356 46 66 44 67 18 >>>> >>>> [image: github.png] adelenelai >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> *From:* Gustavo Seabra <gustavo.sea...@gmail.com> >>>> *Sent:* Sunday, October 25, 2020 2:27:15 PM >>>> *To:* Adelene LAI >>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >>>> Key >>>> >>>> Actually, I was trying to generate all stereoisomers for molecules in >>>> a database, and filter duplicate molecules by using the InChI Key to >>>> detect duplicates. But it gives cis/trans isomers on sp2-N the same Key. >>>> >>>> Gustavo. >>>> >>>> -- >>>> Gustavo Seabra >>>> >>>> ------------------------------ >>>> *From:* Adelene LAI <adelene....@uni.lu> >>>> *Sent:* Sunday, October 25, 2020 1:44:01 AM >>>> *To:* Gustavo Seabra <gustavo.sea...@gmail.com> >>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >>>> Key >>>> >>>> >>>> Hi Gustavo, >>>> >>>> >>>> It occurred to me while swimming yesterday - was there a reason you >>>> pointed out the hybridisation state of N in your original subject text? >>>> >>>> >>>> Was it just to specify which N to focus on, or did you expect something >>>> special about sp2 hybridisation wrt InChIKey? >>>> >>>> >>>> Adelene >>>> >>>> >>>> Doctoral Researcher >>>> >>>> Environmental Cheminformatics >>>> >>>> UNIVERSITÉ DU LUXEMBOURG >>>> >>>> >>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>> >>>> 6, avenue du Swing, L-4367 Belvaux >>>> >>>> T +356 46 66 44 67 18 >>>> >>>> [image: github.png] adelenelai >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> *From:* Gustavo Seabra <gustavo.sea...@gmail.com> >>>> *Sent:* Saturday, October 24, 2020 5:37:09 AM >>>> *To:* RDKit Discuss; Adelene LAI >>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >>>> Key >>>> >>>> Thanks for looking into it. I'm happy to see.it wasn't just a mistake >>>> by me ;-) >>>> >>>> I hope we can find what's wrong there. >>>> >>>> Best, >>>> Gustavo. >>>> >>>> -- >>>> Gustavo Seabra >>>> >>>> ------------------------------ >>>> *From:* Adelene LAI <adelene....@uni.lu> >>>> *Sent:* Friday, October 23, 2020 11:28:55 PM >>>> *To:* Gustavo Seabra <gustavo.sea...@gmail.com>; RDKit Discuss < >>>> rdkit-discuss@lists.sourceforge.net> >>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >>>> Key >>>> >>>> >>>> Hi Gustavo, >>>> >>>> >>>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >>>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >>>> >>>> >>>> In the gist above, I tried doing some further investigating. >>>> >>>> >>>> It seems for the example you gave, the rdkit functions indeed give the >>>> same inchikey and inchi, but different aux info. >>>> >>>> >>>> Why this different aux info doesn't translate into different >>>> inchikeys/inchis, I'm not sure. >>>> >>>> >>>> Adelene >>>> >>>> >>>> >>>> >>>> >>>> >>>> Doctoral Researcher >>>> >>>> Environmental Cheminformatics >>>> >>>> UNIVERSITÉ DU LUXEMBOURG >>>> >>>> >>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>> >>>> 6, avenue du Swing, L-4367 Belvaux >>>> >>>> T +356 46 66 44 67 18 >>>> >>>> [image: github.png] adelenelai >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> *From:* Gustavo Seabra <gustavo.sea...@gmail.com> >>>> *Sent:* Friday, October 23, 2020 6:43:07 PM >>>> *To:* RDKit Discuss >>>> *Subject:* [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key >>>> >>>> Hi all, >>>> >>>> I run into an issue here, and I'd appreciate your input. I noticed that >>>> compounds that differ only on the cis-trans isomerization around an sp2 >>>> nitrogen get the same InChI Key from RDKit. For example: >>>> >>>> > inchi_cis = >>>> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) >>>> > inchi_cis >>>> 'AQIXAKUUQRKLND-UHFFFAOYSA-N' >>>> >>>> > inchi_trans = >>>> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) >>>> > inchi_trans >>>> 'AQIXAKUUQRKLND-UHFFFAOYSA-N' >>>> >>>> > inchi_cis == inchi_trans >>>> True >>>> >>>> I wonder if this is a limitation of the InChI Key definition, or an >>>> implementation issue. >>>> >>>> Thanks a lot, >>>> -- >>>> Gustavo Seabra. >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdkit-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss