Hello Theo,
I cannot immediately think of any other obvious way to check molecular 
identity, but please note that my feedback was more about the chemistry aspects 
than about rdkit.

Assuming that the methods you listed are the main/only available ones, I 
suppose the only thing to do is try them out with some test dataset, and see 
which is fastest and especially if the results are the same or not.
'Efficient' is usually associated with 'fast', but personally I would always 
keep a close eye on accuracy, meaning I would not go for a fast method just 
because it's fast / before I know it gives the answer I want.

E.g. bidirectional substructure might or might not be faster, no idea, but how 
is it going to handle tautomers?
And more importantly, do *you* want tautomers (that are usually assumed to be 
easily interconverting) to be considered the same molecule or not? If the 
answer is no, maybe the substructure method is indeed the best for you, and 
inchikey's would actually hide formal differences that you want to detect.

As for Inchi strings vs Inchikey's, I can only say that the Inchikey was 
advised to me as a 'practical' text identifier when I started working in 
chemoinformatics, and it seemed to work well, together with the chirality flag, 
e.g. to match molecules from catalogues with molecules that people in my 
company wanted to acquire. Better than SMILES, for the reasons I mentioned, 
because for us it was more damaging to have a false negative (not finding a 
molecule that was there because of some formal difference that did not make it 
identical) than a false positive (thinking that a molecule was there when in 
fact it was something else - easy to verify a posteriori; and it never happened 
anyway).
But no, I have no other particular reason to prefer Inchikey's to Inchi 
strings. I did not test the latter though, no idea if they would give the same 
answers as the former, or how much slower or faster string comparisons would 
be. In fact, as Inchikey's have fixed length whereas Inchi strings don't appear 
to, who knows, maybe Inchi strings are even better, as you would immediately 
know two molecules are not the same by just looking at their lengths.
All to be tested...

https://en.wikipedia.org/wiki/International_Chemical_Identifier

Another thing I should probably mention, if I have not already: not sure what 
datasets you are going to process, but be careful about salts and solvates, too.
The substance you have in the data record is sometimes not a single entity, but 
a mixture of fragments, one of which is probably the 'main' molecule you are 
interested in, the rest being other molecules that are 'not important', say for 
biological activity etc, but make a difference to the formal representation of 
the substance (and indicate formally distinct compositions).
You can detect the presence of multiple fragments in a SMILES by the presence 
of a dot '.' between them.
Example: is pyridine.HCl the same as pyridine, for your purposes, because HCl 
is not important? If the answer is no, fine, you want a fully formal match; if 
it is yes, even the bidirectional substructure method will give you a false 
negative.
Sorry if this is all obvious to you; no harm though, as other people who are 
not chemists might be reading this at some point.

brg
Giovanni

-----Original Message-----
From: theozh <the...@gmx.net>
Sent: 05 October 2021 10:07
To: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] What is the most efficient way to check for exact 
match with RDKit?

[Some people who received this message don't often get email from 
the...@gmx.net. Learn why this is important at 
http://aka.ms/LearnAboutSenderIdentification.]

Dear Giovanni,

thank you for your explanations and advice. So, I just wanted to exclude that I 
maybe missed a very basic function of checking identity.

You are suggesting using InChI-Keys (with the very low probability having the 
same InChI-key for different molecules).
Then, what would be the disadvantage of using InChI strings instead of 
InChI-keys? Computation time & power?

The reponse I got from StackOverflow was that the substructure approach was a 
little faster than the Canonical SMILES approach.
I would assume that a simple string comparison within a fixed set of structures 
is much faster than calculating the Canonical SMILES again and again for each 
search.

So, I will check the InChI approach and compare it with the other approaches.

Thanks,
Theo.


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&amp;data=04%7C01%7C%7C64b7eff6a7714393bc9008d987d7420e%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637690180884967270%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BU2rXD98DCiJy2Xp9F7JNOw7MmyI3lGNF9eHn5glvYk%3D&amp;reserved=0
This e-mail and its attachment(s) (if any) may contain confidential and/or 
proprietary information and is intended for its addressee(s) only. Any 
unauthorized use of the information contained herein (including, but not 
limited to, alteration, reproduction, communication, distribution or any other 
form of dissemination) is strictly prohibited. If you are not the intended 
addressee, please notify the originator promptly and delete this e-mail and its 
attachment(s) (if any) subsequently. Neither Galapagos nor any of its 
affiliates shall be liable for direct, special, indirect or consequential 
damages arising from alteration of the contents of this message (by a third 
party) or as a result of a virus being passed on.


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to