Hello Theo, in my experience, something like approach 3 is quite safe (not sure how computationally efficient in rdkit, but I suppose you'd rather have an accurate slow method than a fast often wrong one, right?).
In short: In the greatest majority of cases, the literal string match of inchikey means two molecules are 'the same'. When the two inchikey's do NOT match, it does not mean the molecules are 'not the same', it depends on how the inchikey is calculated (see further down). For some time we used the inchikey (not inchi string) + chirality flag as a ~unique identifier of a molecule. If you start from a SMILES rather than CTAB you don't need to worry about the chirality flag. Someone calculated how likely it is that two different molecules give the same inchikey, and it seems it's extremely rare. There could be a problem if you started to look at huge combinatorial sets of billions of molecules, where even that very rare occurrence might materialise. The SMILES comparison, even canonical SMILES, may often fail due to different tautomeric forms. And indeed, the important part is how the inchikey is calculated. The software we use (Biovia) 'knows' about the most important tautomers, like pyrazole, 2- and 4- pyridones / hydroxypyridines, etc., so for instance: [cid:image002.png@01D7B5DC.15DF3060] Obviously if the inchikey calculation in rdkit missed that, two different representations of 'the same' molecule would give different inchikeys, just as well as it would give different canonical SMILES. I have not yet looked at rdkit's inchikey calculator; perhaps you already know about these aspects. But this is a very subtle point. Tautomers are not like resonance structures that interconvert by movement of electrons alone, they are really formally distinct molecules, which only interconvert by moving atoms around (in the above example, a hydrogen atom). So in a way, the SMILES is 'correct' in saying that these are two different molecules. It is only because we know from chemistry that the interconversion between them is fast, and that an equilibrium is reached in the media we are usually interested in, that we consider them 'the same'. I hope this helps. Giovanni -----Original Message----- From: theozh <the...@gmx.net> Sent: 30 September 2021 08:44 To: rdkit-discuss@lists.sourceforge.net Subject: [Rdkit-discuss] What is the most efficient way to check for exact match with RDKit? *** CAUTION : External e-mail *** Dear all, it looks like a simple/stupid question... but I haven't found (or overlooked) an example the RDKit cookbook. What is the intended (and most efficient) way in RDKit to search for identity, i.e. exact match? I asked this question already here: https://stackoverflow.com/questions/60211666/rdkit-how-to-check-molecules-for-exact-match and got some answers, but maybe from the RDKit mailing-list audience there might be other (more efficient) solutions? Assumption: SMILES A and SMILES B. Approach 1: If A is a substructure of B and B is a substructure of A then the structures are identical. Approach 2: Create Canonical SMILES of A and B and do a string comparison. Approach 3: (not sure whether this will work) Creating InChI of A and B. Would a simple string comparison work here as well? So, if I have a given list of structures, I could once generate a Canonical SMILES list (or maybe InChI list?) and do a simple string comparison. Would this be the most efficient way to check if a certain structure is in the list? Thank you for any comments, hints, suggestions. Theo. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss