Canonical SMILES is probably the way to go, but you might also be able to use the InchiKey and the Inchi auxiliary information together as a compound hash key.
-P. On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI <adelene....@uni.lu> wrote: > Hi Gustavo, > > > (Sorry, forgot to reply all before...) > > > Your deduplication task is quite familiar to me and something I do quite a > lot of in my own work ;) > > > Can I suggest deduplicating using Canonical SMILES? > > > It doesn't solve your InChIKey issue, but it is a solution for now. > > > I updated my gist to show that it is feasible: > > > https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f > > > <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> > > Adelene > > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > ------------------------------ > *From:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Sent:* Sunday, October 25, 2020 2:27:15 PM > *To:* Adelene LAI > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Actually, I was trying to generate all stereoisomers for molecules in a > database, and filter duplicate molecules by using the InChI Key to detect > duplicates. But it gives cis/trans isomers on sp2-N the same Key. > > Gustavo. > > -- > Gustavo Seabra > > ------------------------------ > *From:* Adelene LAI <adelene....@uni.lu> > *Sent:* Sunday, October 25, 2020 1:44:01 AM > *To:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > > Hi Gustavo, > > > It occurred to me while swimming yesterday - was there a reason you > pointed out the hybridisation state of N in your original subject text? > > > Was it just to specify which N to focus on, or did you expect something > special about sp2 hybridisation wrt InChIKey? > > > Adelene > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > ------------------------------ > *From:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Sent:* Saturday, October 24, 2020 5:37:09 AM > *To:* RDKit Discuss; Adelene LAI > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Thanks for looking into it. I'm happy to see.it wasn't just a mistake by > me ;-) > > I hope we can find what's wrong there. > > Best, > Gustavo. > > -- > Gustavo Seabra > > ------------------------------ > *From:* Adelene LAI <adelene....@uni.lu> > *Sent:* Friday, October 23, 2020 11:28:55 PM > *To:* Gustavo Seabra <gustavo.sea...@gmail.com>; RDKit Discuss < > rdkit-discuss@lists.sourceforge.net> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > > Hi Gustavo, > > > <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> > https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f > > > In the gist above, I tried doing some further investigating. > > > It seems for the example you gave, the rdkit functions indeed give the > same inchikey and inchi, but different aux info. > > > Why this different aux info doesn't translate into different > inchikeys/inchis, I'm not sure. > > > Adelene > > > > > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > ------------------------------ > *From:* Gustavo Seabra <gustavo.sea...@gmail.com> > *Sent:* Friday, October 23, 2020 6:43:07 PM > *To:* RDKit Discuss > *Subject:* [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Hi all, > > I run into an issue here, and I'd appreciate your input. I noticed that > compounds that differ only on the cis-trans isomerization around an sp2 > nitrogen get the same InChI Key from RDKit. For example: > > > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > > inchi_cis > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > > inchi_trans > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > > inchi_cis == inchi_trans > True > > I wonder if this is a limitation of the InChI Key definition, or an > implementation issue. > > Thanks a lot, > -- > Gustavo Seabra. > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss