Sure, here is:

1. The question:

"I noticed that compounds that differ only on the cis-trans isomerization
> around an sp2 nitrogen get the same InChI Key from RDKit. For example:
> > inchi_cis =
> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C"))
> > inchi_cis
> 'AQIXAKUUQRKLND-UHFFFAOYSA-N'
> > inchi_trans =
> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C"))
> > inchi_trans
> 'AQIXAKUUQRKLND-UHFFFAOYSA-N'
> > inchi_cis == inchi_trans
> True
> I wonder if this is a limitation of the InChI Key definition, or an
> implementation issue.


There answer to the question, in the end, was that the InChI Keys were
behaving as intended, by design, as pointed out by Igor Pletnev:

though InChI is not perfect, in this case it behaves as intended.
> Please see below.
> The discussed molecules contain substituted guanidine fragment
> (RHN)C(=NMe)(NHR')
> It is subjected to tautomerism, and in different tautomers different C-N
> bonds have double order:
> (RHN)C(=NMe)(NHR')
> (RHN)C(NHMe)(=NR')
> (RN=)C(NHMe)(NHR')
> You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in
> the examples.
> Standard InChI is specifically designed to produce the same identifier for
> all tautomers (by indicating that two hydrogens are shared by three
> nitrogen atoms, for any tautomer; bond orders are not indicated in InChI).
> As the tautomer-invariant Std InChI does not know which C-N bond is
> actually a double, there is the only option for treating stereo -- to
> completely ignore it as a drawing artifact.
> All in all:
> Standard InChI means that the exact tautomeric form is unknown ==> all
> tautomers are mapped to the same generic representation ==>  the exact C-N
> double bond placement in this generic is unspecified ==> C-N double bond
> stereo is ignored ==> generated StdInChI and Std InChIKey are the same for
> seemingly different, by initial drawing, cis/trans forms.
> Once again, this behavior is by design; it is intended for maximal
> interoperability while comparing different drawings of the "same" compound.
> If, for any reason, you would like to consider your examples as the
> definite and resolvable structures, each having its own identifier, just
> use non-Standard InChI.
> The InChI which preserves the exact positions of tautomeric H's and double
> bond ("as drawn") is produced by just specifying option /FixedH upon
> generation.
> More on this may be found in InChI FAQ:
> https://www.inchi-trust.org/technical-faq-2/


The only question remaining was how to use this "/FixedH" option in RDKit,
and that was answered by Paolo Tosco:

you can pass InChI options to the underlying InChI API through the
options parameter
> of Chem.inchi.MolToInchi() and  Chem.inchi.MolToInchiKey(); e.g.:
> inchi.MolToInchi(mol, options="/FixedH")
> Source:
> https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi


And this is what I'm using now to remove duplicate molecules from my
database. I'm using a Pandas DataFrame and, with the more recent versions
of Pandas, the following works fine:

> df['InChI Key'] = df[mol_col].progress_apply(Chem.MolToInchiKey,
options="/FixedH")
> df.drop_duplicates(subset=['InChI Key'], keep='first', inplace=True)

All the best,
--
Gustavo Seabra.


On Fri, Oct 30, 2020 at 4:47 AM Adelene LAI <adelene....@uni.lu> wrote:

> Hi Gustavo,
>
>
> Looks like you found a solution for your deduplication task. Would you
> mind sharing it with us? (Seems some emails in the chain are missing.)
>
>
> I'm curious - returning to your original question, did we figure out why
> the same InChIKey was given for the stereoisomers?
>
>
> Adelene
>
>
> Doctoral Researcher
>
> Environmental Cheminformatics
>
> UNIVERSITÉ DU LUXEMBOURG
>
>
> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>
> 6, avenue du Swing, L-4367 Belvaux
>
> T +356 46 66 44 67 18
>
> [image: github.png] adelenelai
>
>
>
>
>
> ------------------------------
> *From:* Gustavo Seabra <gustavo.sea...@gmail.com>
> *Sent:* Thursday, October 29, 2020 10:23:20 PM
> *To:* Paolo Tosco
> *Cc:* Igor Pletnev; RDKit Discuss
> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>
> Aha! Fantastic!
>
> Thanks a lot!!
> Gustavo.
>
> --
> Gustavo Seabra
>
> ------------------------------
> *From:* Paolo Tosco <paolo.tosco.m...@gmail.com>
> *Sent:* Thursday, October 29, 2020 5:13:33 PM
> *To:* Gustavo Seabra <gustavo.sea...@gmail.com>
> *Cc:* Igor Pletnev <igor.plet...@gmail.com>; RDKit Discuss <
> rdkit-discuss@lists.sourceforge.net>
> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>
> Hi Gustavo,
>
> you can pass InChI options to the underlying InChI API through the options
> parameter of Chem.inchi.MolToInchi() and  Chem.inchi.MolToInchiKey();
> e.g.:
>
> inchi.MolToInchi(mol, options="/FixedH")
>
> Source:
> https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi
>
> Cheers,
> p.
>
> On Thu, Oct 29, 2020 at 9:42 PM Gustavo Seabra <gustavo.sea...@gmail.com>
> wrote:
>
> Ok, thanks!
> --
> Gustavo Seabra.
>
>
> On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev <igor.plet...@gmail.com>
> wrote:
>
> >  Is this "/FixedH" an option in RDKit? How to use that? (I don't see it
> in the docs).
>
> Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway,
> this option is available in InChI API calls, and I am pretty sure that it
> is also available in RDKit.
>
> I recall that couple of years ago, on some InChI event,  Greg Landrum
> somewhat surprised me by saying that he himself often uses non-Standard
> InChI instead of Standard one — exactly to distinguish tautomers.
> So I guess Greg can answer on how it is arranged in RDKit.
>
> Regards,
> Igor
>
>
>
>
>
> On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra <gustavo.sea...@gmail.com>
> wrote:
>
> That does make sense, I understand it now, thanks!
>
> Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in
> the docs).
>
> Thanks,
> --
> Gustavo Seabra.
>
>
> On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev <igor.plet...@gmail.com>
> wrote:
>
> Hi Gustavo,
>
> >  ... I was generating the InChI Keys to get a unique hash for each
> compound, thinking it would be better than SMILES (guaranteed to be
> unique), but is clearly not the case. On the bright side, I won't lose time
> generating InChIs...
>
> though InChI is not perfect, in this case it behaves as intended.
> Please see below.
>
> The discussed molecules contain substituted guanidine fragment
> (RHN)C(=NMe)(NHR')
>
> It is subjected to tautomerism, and in different tautomers different C-N
> bonds have double order:
> (RHN)C(=NMe)(NHR')
> (RHN)C(NHMe)(=NR')
> (RN=)C(NHMe)(NHR')
>
> You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in
> the examples.
> Standard InChI is specifically designed to produce the same identifier for
> all tautomers (by indicating that two hydrogens are shared by three
> nitrogen atoms, for any tautomer; bond orders are not indicated in InChI).
>
> As the tautomer-invariant Std InChI does not know which C-N bond is
> actually a double, there is the only option for treating stereo -- to
> completely ignore it as a drawing artifact.
>
> All in all:
> Standard InChI means that the exact tautomeric form is unknown ==> all
> tautomers are mapped to the same generic representation ==>  the exact C-N
> double bond placement in this generic is unspecified ==> C-N double bond
> stereo is ignored ==> generated StdInChI and Std InChIKey are the same for
> seemingly different, by initial drawing, cis/trans forms.
>
> Once again, this behavior is by design; it is intended for maximal
> interoperability while comparing different drawings of the "same" compound.
>
> If, for any reason, you would like to consider your examples as the
> definite and resolvable structures, each having its own identifier, just
> use non-Standard InChI.
> The InChI which preserves the exact positions of tautomeric H's and double
> bond ("as drawn") is produced by just specifying option /FixedH upon
> generation.
>
> More on this may be found in InChI FAQ:
> https://www.inchi-trust.org/technical-faq-2/
>
> Hope this helps.
>
> Regards,
> Igor
>
>
>
> On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra <gustavo.sea...@gmail.com>
> wrote:
>
> Thanks a lot Peter and Adelene,
>
> Yes, it looks like canonical SMILES is the way to go, and I have no
> problem sticking with RDKit. I was generating the InChI Keys to get a
> unique hash for each compound, thinking it would be better than SMILES
> (guaranteed to be unique), but is clearly not the case. On the bright side,
> I won't lose time generating InChIs...
>
> Can I trust that the same molecule will always get the same canonical
> SMILES from RDKit, independent of how it is read? (Different SDF files,
> geometries, atom orders, etc.?)
>
> All the best,
> Gustavo.
>
>
> --
> Gustavo Seabra.
>
>
> On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin <shen...@gmail.com>
> wrote:
>
> Canonical SMILES is probably the way to go, but you might also be able to
> use the InchiKey and the Inchi auxiliary information together as a compound
> hash key.
>
> -P.
>
> On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI <adelene....@uni.lu> wrote:
>
> Hi Gustavo,
>
>
> (Sorry, forgot to reply all before...)
>
>
> Your deduplication task is quite familiar to me and something I do quite a
> lot of in my own work ;)
>
>
> Can I suggest deduplicating using Canonical SMILES?
>
>
> It doesn't solve your InChIKey issue, but it is a solution for now.
>
>
> I updated my gist to show that it is feasible:
>
>
> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f
>
>
> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>
>
> Adelene
>
>
>
> Doctoral Researcher
>
> Environmental Cheminformatics
>
> UNIVERSITÉ DU LUXEMBOURG
>
>
> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>
> 6, avenue du Swing
> <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>,
> L-4367 Belvaux
>
> T +356 46 66 44 67 18
>
> [image: github.png] adelenelai
>
>
>
>
>
> ------------------------------
> *From:* Gustavo Seabra <gustavo.sea...@gmail.com>
> *Sent:* Sunday, October 25, 2020 2:27:15 PM
> *To:* Adelene LAI
> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>
> Actually,  I was trying to generate all stereoisomers for molecules in a
> database,  and filter duplicate molecules by using the InChI Key to detect
> duplicates.  But it gives cis/trans isomers on sp2-N the same Key.
>
> Gustavo.
>
> --
> Gustavo Seabra
>
> ------------------------------
> *From:* Adelene LAI <adelene....@uni.lu>
> *Sent:* Sunday, October 25, 2020 1:44:01 AM
> *To:* Gustavo Seabra <gustavo.sea...@gmail.com>
> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>
>
> Hi Gustavo,
>
>
> It occurred to me while swimming yesterday - was there a reason you
> pointed out the hybridisation state of N in your original subject text?
>
>
> Was it just to specify which N to focus on, or did you expect something
> special about sp2 hybridisation wrt InChIKey?
>
>
> Adelene
>
>
> Doctoral Researcher
>
> Environmental Cheminformatics
>
> UNIVERSITÉ DU LUXEMBOURG
>
>
> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>
> 6, avenue du Swing
> <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>,
> L-4367 Belvaux
>
> T +356 46 66 44 67 18
>
> [image: github.png] adelenelai
>
>
>
>
>
> ------------------------------
> *From:* Gustavo Seabra <gustavo.sea...@gmail.com>
> *Sent:* Saturday, October 24, 2020 5:37:09 AM
> *To:* RDKit Discuss; Adelene LAI
> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>
> Thanks for looking into it. I'm happy to see.it wasn't just a mistake by
> me ;-)
>
> I hope we can find what's wrong there.
>
> Best,
> Gustavo.
>
> --
> Gustavo Seabra
>
> ------------------------------
> *From:* Adelene LAI <adelene....@uni.lu>
> *Sent:* Friday, October 23, 2020 11:28:55 PM
> *To:* Gustavo Seabra <gustavo.sea...@gmail.com>; RDKit Discuss <
> rdkit-discuss@lists.sourceforge.net>
> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>
>
> Hi Gustavo,
>
>
> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>
> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f
>
>
> In the gist above, I tried doing some further investigating.
>
>
> It seems for the example you gave, the rdkit functions indeed give the
> same inchikey and inchi, but different aux info.
>
>
> Why this different aux info doesn't translate into different
> inchikeys/inchis, I'm not sure.
>
>
> Adelene
>
>
>
>
>
>
> Doctoral Researcher
>
> Environmental Cheminformatics
>
> UNIVERSITÉ DU LUXEMBOURG
>
>
> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>
> 6, avenue du Swing
> <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>,
> L-4367 Belvaux
>
> T +356 46 66 44 67 18
>
> [image: github.png] adelenelai
>
>
>
>
>
> ------------------------------
> *From:* Gustavo Seabra <gustavo.sea...@gmail.com>
> *Sent:* Friday, October 23, 2020 6:43:07 PM
> *To:* RDKit Discuss
> *Subject:* [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>
> Hi all,
>
> I run into an issue here, and I'd appreciate your input. I noticed that
> compounds that differ only on the cis-trans isomerization around an sp2
> nitrogen get the same InChI Key from RDKit. For example:
>
> > inchi_cis =
> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C"))
> > inchi_cis
> 'AQIXAKUUQRKLND-UHFFFAOYSA-N'
>
> > inchi_trans =
> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C"))
> > inchi_trans
> 'AQIXAKUUQRKLND-UHFFFAOYSA-N'
>
> > inchi_cis == inchi_trans
> True
>
> I wonder if this is a limitation of the InChI Key definition, or an
> implementation issue.
>
> Thanks a lot,
>
> --
> Gustavo Seabra.
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to