Hello Theo, I cannot immediately think of any other obvious way to check molecular identity, but please note that my feedback was more about the chemistry aspects than about rdkit.
Assuming that the methods you listed are the main/only available ones, I suppose the only thing to do is try them out with some test dataset, and see which is fastest and especially if the results are the same or not. 'Efficient' is usually associated with 'fast', but personally I would always keep a close eye on accuracy, meaning I would not go for a fast method just because it's fast / before I know it gives the answer I want. E.g. bidirectional substructure might or might not be faster, no idea, but how is it going to handle tautomers? And more importantly, do *you* want tautomers (that are usually assumed to be easily interconverting) to be considered the same molecule or not? If the answer is no, maybe the substructure method is indeed the best for you, and inchikey's would actually hide formal differences that you want to detect. As for Inchi strings vs Inchikey's, I can only say that the Inchikey was advised to me as a 'practical' text identifier when I started working in chemoinformatics, and it seemed to work well, together with the chirality flag, e.g. to match molecules from catalogues with molecules that people in my company wanted to acquire. Better than SMILES, for the reasons I mentioned, because for us it was more damaging to have a false negative (not finding a molecule that was there because of some formal difference that did not make it identical) than a false positive (thinking that a molecule was there when in fact it was something else - easy to verify a posteriori; and it never happened anyway). But no, I have no other particular reason to prefer Inchikey's to Inchi strings. I did not test the latter though, no idea if they would give the same answers as the former, or how much slower or faster string comparisons would be. In fact, as Inchikey's have fixed length whereas Inchi strings don't appear to, who knows, maybe Inchi strings are even better, as you would immediately know two molecules are not the same by just looking at their lengths. All to be tested... https://en.wikipedia.org/wiki/International_Chemical_Identifier Another thing I should probably mention, if I have not already: not sure what datasets you are going to process, but be careful about salts and solvates, too. The substance you have in the data record is sometimes not a single entity, but a mixture of fragments, one of which is probably the 'main' molecule you are interested in, the rest being other molecules that are 'not important', say for biological activity etc, but make a difference to the formal representation of the substance (and indicate formally distinct compositions). You can detect the presence of multiple fragments in a SMILES by the presence of a dot '.' between them. Example: is pyridine.HCl the same as pyridine, for your purposes, because HCl is not important? If the answer is no, fine, you want a fully formal match; if it is yes, even the bidirectional substructure method will give you a false negative. Sorry if this is all obvious to you; no harm though, as other people who are not chemists might be reading this at some point. brg Giovanni -----Original Message----- From: theozh <the...@gmx.net> Sent: 05 October 2021 10:07 To: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] What is the most efficient way to check for exact match with RDKit? [Some people who received this message don't often get email from the...@gmx.net. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] Dear Giovanni, thank you for your explanations and advice. So, I just wanted to exclude that I maybe missed a very basic function of checking identity. You are suggesting using InChI-Keys (with the very low probability having the same InChI-key for different molecules). Then, what would be the disadvantage of using InChI strings instead of InChI-keys? Computation time & power? The reponse I got from StackOverflow was that the substructure approach was a little faster than the Canonical SMILES approach. I would assume that a simple string comparison within a fixed set of structures is much faster than calculating the Canonical SMILES again and again for each search. So, I will check the InChI approach and compare it with the other approaches. Thanks, Theo. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=04%7C01%7C%7C64b7eff6a7714393bc9008d987d7420e%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637690180884967270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BU2rXD98DCiJy2Xp9F7JNOw7MmyI3lGNF9eHn5glvYk%3D&reserved=0 This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss