Thanks a lot Peter and Adelene,

Yes, it looks like canonical SMILES is the way to go, and I have no problem
sticking with RDKit. I was generating the InChI Keys to get a unique hash
for each compound, thinking it would be better than SMILES (guaranteed to
be unique), but is clearly not the case. On the bright side, I won't lose
time generating InChIs...

Can I trust that the same molecule will always get the same canonical
SMILES from RDKit, independent of how it is read? (Different SDF files,
geometries, atom orders, etc.?)

All the best,
Gustavo.


--
Gustavo Seabra.


On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin <shen...@gmail.com> wrote:

> Canonical SMILES is probably the way to go, but you might also be able to
> use the InchiKey and the Inchi auxiliary information together as a compound
> hash key.
>
> -P.
>
> On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI <adelene....@uni.lu> wrote:
>
>> Hi Gustavo,
>>
>>
>> (Sorry, forgot to reply all before...)
>>
>>
>> Your deduplication task is quite familiar to me and something I do quite
>> a lot of in my own work ;)
>>
>>
>> Can I suggest deduplicating using Canonical SMILES?
>>
>>
>> It doesn't solve your InChIKey issue, but it is a solution for now.
>>
>>
>> I updated my gist to show that it is feasible:
>>
>>
>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f
>>
>>
>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>
>>
>> Adelene
>>
>>
>>
>> Doctoral Researcher
>>
>> Environmental Cheminformatics
>>
>> UNIVERSITÉ DU LUXEMBOURG
>>
>>
>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>
>> 6, avenue du Swing, L-4367 Belvaux
>>
>> T +356 46 66 44 67 18
>>
>> [image: github.png] adelenelai
>>
>>
>>
>>
>>
>> ------------------------------
>> *From:* Gustavo Seabra <gustavo.sea...@gmail.com>
>> *Sent:* Sunday, October 25, 2020 2:27:15 PM
>> *To:* Adelene LAI
>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI
>> Key
>>
>> Actually,  I was trying to generate all stereoisomers for molecules in a
>> database,  and filter duplicate molecules by using the InChI Key to detect
>> duplicates.  But it gives cis/trans isomers on sp2-N the same Key.
>>
>> Gustavo.
>>
>> --
>> Gustavo Seabra
>>
>> ------------------------------
>> *From:* Adelene LAI <adelene....@uni.lu>
>> *Sent:* Sunday, October 25, 2020 1:44:01 AM
>> *To:* Gustavo Seabra <gustavo.sea...@gmail.com>
>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI
>> Key
>>
>>
>> Hi Gustavo,
>>
>>
>> It occurred to me while swimming yesterday - was there a reason you
>> pointed out the hybridisation state of N in your original subject text?
>>
>>
>> Was it just to specify which N to focus on, or did you expect something
>> special about sp2 hybridisation wrt InChIKey?
>>
>>
>> Adelene
>>
>>
>> Doctoral Researcher
>>
>> Environmental Cheminformatics
>>
>> UNIVERSITÉ DU LUXEMBOURG
>>
>>
>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>
>> 6, avenue du Swing, L-4367 Belvaux
>>
>> T +356 46 66 44 67 18
>>
>> [image: github.png] adelenelai
>>
>>
>>
>>
>>
>> ------------------------------
>> *From:* Gustavo Seabra <gustavo.sea...@gmail.com>
>> *Sent:* Saturday, October 24, 2020 5:37:09 AM
>> *To:* RDKit Discuss; Adelene LAI
>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI
>> Key
>>
>> Thanks for looking into it. I'm happy to see.it wasn't just a mistake by
>> me ;-)
>>
>> I hope we can find what's wrong there.
>>
>> Best,
>> Gustavo.
>>
>> --
>> Gustavo Seabra
>>
>> ------------------------------
>> *From:* Adelene LAI <adelene....@uni.lu>
>> *Sent:* Friday, October 23, 2020 11:28:55 PM
>> *To:* Gustavo Seabra <gustavo.sea...@gmail.com>; RDKit Discuss <
>> rdkit-discuss@lists.sourceforge.net>
>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI
>> Key
>>
>>
>> Hi Gustavo,
>>
>>
>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>
>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f
>>
>>
>> In the gist above, I tried doing some further investigating.
>>
>>
>> It seems for the example you gave, the rdkit functions indeed give the
>> same inchikey and inchi, but different aux info.
>>
>>
>> Why this different aux info doesn't translate into different
>> inchikeys/inchis, I'm not sure.
>>
>>
>> Adelene
>>
>>
>>
>>
>>
>>
>> Doctoral Researcher
>>
>> Environmental Cheminformatics
>>
>> UNIVERSITÉ DU LUXEMBOURG
>>
>>
>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>
>> 6, avenue du Swing, L-4367 Belvaux
>>
>> T +356 46 66 44 67 18
>>
>> [image: github.png] adelenelai
>>
>>
>>
>>
>>
>> ------------------------------
>> *From:* Gustavo Seabra <gustavo.sea...@gmail.com>
>> *Sent:* Friday, October 23, 2020 6:43:07 PM
>> *To:* RDKit Discuss
>> *Subject:* [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
>>
>> Hi all,
>>
>> I run into an issue here, and I'd appreciate your input. I noticed that
>> compounds that differ only on the cis-trans isomerization around an sp2
>> nitrogen get the same InChI Key from RDKit. For example:
>>
>> > inchi_cis =
>> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C"))
>> > inchi_cis
>> 'AQIXAKUUQRKLND-UHFFFAOYSA-N'
>>
>> > inchi_trans =
>> Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C"))
>> > inchi_trans
>> 'AQIXAKUUQRKLND-UHFFFAOYSA-N'
>>
>> > inchi_cis == inchi_trans
>> True
>>
>> I wonder if this is a limitation of the InChI Key definition, or an
>> implementation issue.
>>
>> Thanks a lot,
>> --
>> Gustavo Seabra.
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to