Hi Gustavo,

Looks like you found a solution for your deduplication task. Would you mind 
sharing it with us? (Seems some emails in the chain are missing.)


I'm curious - returning to your original question, did we figure out why the 
same InChIKey was given for the stereoisomers?


Adelene

Doctoral Researcher
Environmental Cheminformatics
UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
6, avenue du Swing, L-4367 Belvaux
T +356 46 66 44 67 18
[github.png] adelenelai





________________________________
From: Gustavo Seabra <gustavo.sea...@gmail.com>
Sent: Thursday, October 29, 2020 10:23:20 PM
To: Paolo Tosco
Cc: Igor Pletnev; RDKit Discuss
Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key

Aha! Fantastic!

Thanks a lot!!
Gustavo.

--
Gustavo Seabra

________________________________
From: Paolo Tosco <paolo.tosco.m...@gmail.com>
Sent: Thursday, October 29, 2020 5:13:33 PM
To: Gustavo Seabra <gustavo.sea...@gmail.com>
Cc: Igor Pletnev <igor.plet...@gmail.com>; RDKit Discuss 
<rdkit-discuss@lists.sourceforge.net>
Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key

Hi Gustavo,

you can pass InChI options to the underlying InChI API through the options 
parameter of Chem.inchi.MolToInchi() and  Chem.inchi.MolToInchiKey(); e.g.:

inchi.MolToInchi(mol, options="/FixedH")

Source: 
https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi

Cheers,
p.

On Thu, Oct 29, 2020 at 9:42 PM Gustavo Seabra 
<gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> wrote:
Ok, thanks!
--
Gustavo Seabra.


On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev 
<igor.plet...@gmail.com<mailto:igor.plet...@gmail.com>> wrote:
>  Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in 
> the docs).

Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway, this 
option is available in InChI API calls, and I am pretty sure that it is also 
available in RDKit.

I recall that couple of years ago, on some InChI event,  Greg Landrum somewhat 
surprised me by saying that he himself often uses non-Standard InChI instead of 
Standard one — exactly to distinguish tautomers.
So I guess Greg can answer on how it is arranged in RDKit.

Regards,
Igor





On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra 
<gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> wrote:
That does make sense, I understand it now, thanks!

Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the 
docs).

Thanks,
--
Gustavo Seabra.


On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev 
<igor.plet...@gmail.com<mailto:igor.plet...@gmail.com>> wrote:
Hi Gustavo,

>  ... I was generating the InChI Keys to get a unique hash for each compound, 
> thinking it would be better than SMILES (guaranteed to be unique), but is 
> clearly not the case. On the bright side, I won't lose time generating 
> InChIs...

though InChI is not perfect, in this case it behaves as intended.
Please see below.

The discussed molecules contain substituted guanidine fragment 
(RHN)C(=NMe)(NHR')

It is subjected to tautomerism, and in different tautomers different C-N bonds 
have double order:
(RHN)C(=NMe)(NHR')
(RHN)C(NHMe)(=NR')
(RN=)C(NHMe)(NHR')

You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in the 
examples.
Standard InChI is specifically designed to produce the same identifier for all 
tautomers (by indicating that two hydrogens are shared by three nitrogen atoms, 
for any tautomer; bond orders are not indicated in InChI).

As the tautomer-invariant Std InChI does not know which C-N bond is actually a 
double, there is the only option for treating stereo -- to completely ignore it 
as a drawing artifact.

All in all:
Standard InChI means that the exact tautomeric form is unknown ==> all 
tautomers are mapped to the same generic representation ==>  the exact C-N 
double bond placement in this generic is unspecified ==> C-N double bond stereo 
is ignored ==> generated StdInChI and Std InChIKey are the same for seemingly 
different, by initial drawing, cis/trans forms.

Once again, this behavior is by design; it is intended for maximal 
interoperability while comparing different drawings of the "same" compound.

If, for any reason, you would like to consider your examples as the definite 
and resolvable structures, each having its own identifier, just use 
non-Standard InChI.
The InChI which preserves the exact positions of tautomeric H's and double bond 
("as drawn") is produced by just specifying option /FixedH upon generation.

More on this may be found in InChI FAQ:
https://www.inchi-trust.org/technical-faq-2/

Hope this helps.

Regards,
Igor



On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra 
<gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>> wrote:
Thanks a lot Peter and Adelene,

Yes, it looks like canonical SMILES is the way to go, and I have no problem 
sticking with RDKit. I was generating the InChI Keys to get a unique hash for 
each compound, thinking it would be better than SMILES (guaranteed to be 
unique), but is clearly not the case. On the bright side, I won't lose time 
generating InChIs...

Can I trust that the same molecule will always get the same canonical SMILES 
from RDKit, independent of how it is read? (Different SDF files, geometries, 
atom orders, etc.?)

All the best,
Gustavo.


--
Gustavo Seabra.


On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin 
<shen...@gmail.com<mailto:shen...@gmail.com>> wrote:
Canonical SMILES is probably the way to go, but you might also be able to use 
the InchiKey and the Inchi auxiliary information together as a compound hash 
key.

-P.

On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI 
<adelene....@uni.lu<mailto:adelene....@uni.lu>> wrote:

Hi Gustavo,


(Sorry, forgot to reply all before...)


Your deduplication task is quite familiar to me and something I do quite a lot 
of in my own work ;)


Can I suggest deduplicating using Canonical SMILES?


It doesn't solve your InChIKey issue, but it is a solution for now.


I updated my gist to show that it is feasible:


https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f


<https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>

Adelene



Doctoral Researcher

Environmental Cheminformatics

UNIVERSITÉ DU LUXEMBOURG


LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE

6, avenue du 
Swing<https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>,
 L-4367 Belvaux

T +356 46 66 44 67 18

[github.png] adelenelai





________________________________
From: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>>
Sent: Sunday, October 25, 2020 2:27:15 PM
To: Adelene LAI
Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key

Actually,  I was trying to generate all stereoisomers for molecules in a 
database,  and filter duplicate molecules by using the InChI Key to detect 
duplicates.  But it gives cis/trans isomers on sp2-N the same Key.

Gustavo.

--
Gustavo Seabra

________________________________
From: Adelene LAI <adelene....@uni.lu<mailto:adelene....@uni.lu>>
Sent: Sunday, October 25, 2020 1:44:01 AM
To: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>>
Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key


Hi Gustavo,


It occurred to me while swimming yesterday - was there a reason you pointed out 
the hybridisation state of N in your original subject text?


Was it just to specify which N to focus on, or did you expect something special 
about sp2 hybridisation wrt InChIKey?


Adelene


Doctoral Researcher

Environmental Cheminformatics

UNIVERSITÉ DU LUXEMBOURG


LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE

6, avenue du 
Swing<https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>,
 L-4367 Belvaux

T +356 46 66 44 67 18

[github.png] adelenelai





________________________________
From: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>>
Sent: Saturday, October 24, 2020 5:37:09 AM
To: RDKit Discuss; Adelene LAI
Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key

Thanks for looking into it. I'm happy to see.it<http://see.it> wasn't just a 
mistake by me ;-)

I hope we can find what's wrong there.

Best,
Gustavo.

--
Gustavo Seabra

________________________________
From: Adelene LAI <adelene....@uni.lu<mailto:adelene....@uni.lu>>
Sent: Friday, October 23, 2020 11:28:55 PM
To: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>>; 
RDKit Discuss 
<rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net>>
Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key


Hi Gustavo,


<https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f


In the gist above, I tried doing some further investigating.


It seems for the example you gave, the rdkit functions indeed give the same 
inchikey and inchi, but different aux info.


Why this different aux info doesn't translate into different inchikeys/inchis, 
I'm not sure.


Adelene






Doctoral Researcher

Environmental Cheminformatics

UNIVERSITÉ DU LUXEMBOURG


LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE

6, avenue du 
Swing<https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail&source=g>,
 L-4367 Belvaux

T +356 46 66 44 67 18

[github.png] adelenelai





________________________________
From: Gustavo Seabra <gustavo.sea...@gmail.com<mailto:gustavo.sea...@gmail.com>>
Sent: Friday, October 23, 2020 6:43:07 PM
To: RDKit Discuss
Subject: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key

Hi all,

I run into an issue here, and I'd appreciate your input. I noticed that 
compounds that differ only on the cis-trans isomerization around an sp2 
nitrogen get the same InChI Key from RDKit. For example:

> inchi_cis = 
> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C"))
> inchi_cis
'AQIXAKUUQRKLND-UHFFFAOYSA-N'

> inchi_trans = 
> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C"))
> inchi_trans
'AQIXAKUUQRKLND-UHFFFAOYSA-N'

> inchi_cis == inchi_trans
True

I wonder if this is a limitation of the InChI Key definition, or an 
implementation issue.

Thanks a lot,
--
Gustavo Seabra.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to