Re: [Rdkit-discuss] comparing two or more tables of molecules
Well, since George mentioned a talk by me, I wish we would have implemented our tool back then using an open-source tool like RDKit (which wasn't very well know back then), and also would have been so smart to use SMARTS for the transformation rules (partially they are implemented as SMARTS but big parts are other CACTVS script functionalities). There is still an intention by me to continue/advance (whatever) on this and make it openly available, but I must admit it is a quite vague intention currently. Markus -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] comparing two or more tables of molecules
Thanks for the interesting links. MolVS looks good, but failed on ‘NC(CC(=O)O)C(=O)[O-].O.O.[Na+]’ which isn’t that extraordinary… Couldn’t get Standardise to work at all, even on the example given; API not intuitive or docs wrong or out of date. I will have a look at the info in the UniChem paper, though not inclined to use a web service for what I want to do. Cheers, Steve. From: George Papadatos [mailto:gpapada...@gmail.com] Sent: 01 December 2016 14:26 To: Greg LandrumCc: Stephen O'hagan ; rdkit-discuss@lists.sourceforge.net; Francis Atkinson Subject: Re: [Rdkit-discuss] comparing two or more tables of molecules HI Stephen, Further to Greg's excellent reply, see this paper on how InChI strings and keys can be used in practice to map together tautomer (ones covered by InChI at least), isotope, stereo and parent-salt variants. http://rd.springer.com/article/10.1186/s13321-014-0043-5 Francis (cc'ed) has a nice notebook somewhere illustrating these nice InChI splits to find these variants. For educational purposes, there have been other approaches like the NCI's identifiers - discussion here: http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf For pure structure standardization using RDKit see here: https://github.com/flatkinson/standardiser and https://github.com/mcs07/MolVS Cheers, George On 29 November 2016 at 17:02, Greg Landrum > wrote: Wow, this is a great question and quite a fun thread. It's hard to really make much of a contribution here without writing a book/review article (something that I'm really not willing to do!), but I have a few thoughts. Most of this is repeating/rephrasing things others have already said. I'm going to propose some things as facts. I think that these won't be controversial: fact 1: if the structures are coming from different sources, they need to be standardized/normalized before you compare them. This is true regardless of how you want to compare them. The details of the standardization process are not incredibly important, but it does need to take care of the things you care about when comparing molecules. For example, if you don't care about differences between salts, it should strip salts. If you don't care about differences between tautomers, it should normalize tautomers. fact 2: The InChI algorithm includes a standardization step that normalizes some tautomers, but does not remove salts. fact 3: The InChI representation contain a number of layers defining the structure in increasing detail (this isn't strictly true, because some of the choices about how layers are ordered are arbitrary, but it's close). fact 4: canonicalization, the way I define it, produces a canonical atom numbering for a given structure, but it does *not* standardize fact 5: the RDKit has essentially no well-documented standardization code fact X: we don't have any standard, broadly accepted approach for standardization, canonicalization or representation that is fool-proof or that works for even all of organic chemistry, never mind organometallics. InChI, useful as it is for some things, completely fails to handle things like atropisomers (they are working on this kind of thing, but it's not out yet). Given all of this, if I wanted to have flexible duplicate checking *right* now, I think I would use the AvalonTools struchk functionality that the RDKit provides (the new pure-RDKit version still needs a bit more testing) to handle basic standardization and salt stripping and then produce a table that includes the InChI in a couple of different forms. I'd want to be able to recognize molecules that differ only by stereochemistry, molecules that differ only by location of tautomeric Hs, and molecules that differ only by the location of isotopic labels. You can do this with various clever splits of the InChI (how to do it is left as an exercise for the reader and/or a future RDKit blog post). I think there's something fun to be done here with SMILES variants, borrowing heavily from some of the things that Roger has written about: https://nextmovesoftware.com/blog/2013/04/25/finding-all-types-of-every-mer/ here's a more recent application of that from Noel: https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/ If I didn't really care about details and just wanted something that I could explain easily to others, I'd skip all the complication and just use InChIs (or InChI keys) to recognize duplicates. There would be times when that would be the wrong answer, but it would be a broadly accepted kind of wrong.[1] Regardless of the approach, I would not, under most any circumstances, discard the original input structures that I had. It's really good to be able to figure out what the original data looked like later. -greg [1] I'm crying as I
Re: [Rdkit-discuss] comparing two or more tables of molecules
HI Stephen, Further to Greg's excellent reply, see this paper on how InChI strings and keys can be used in practice to map together tautomer (ones covered by InChI at least), isotope, stereo and parent-salt variants. http://rd.springer.com/article/10.1186/s13321-014-0043-5 Francis (cc'ed) has a nice notebook somewhere illustrating these nice InChI splits to find these variants. For educational purposes, there have been other approaches like the NCI's identifiers - discussion here: http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf For pure structure standardization using RDKit see here: https://github.com/flatkinson/standardiser and https://github.com/mcs07/MolVS Cheers, George On 29 November 2016 at 17:02, Greg Landrumwrote: > Wow, this is a great question and quite a fun thread. > > It's hard to really make much of a contribution here without writing a > book/review article (something that I'm really not willing to do!), but I > have a few thoughts. Most of this is repeating/rephrasing things others > have already said. > > I'm going to propose some things as facts. I think that these won't be > controversial: > fact 1: if the structures are coming from different sources, they need to > be standardized/normalized before you compare them. This is true regardless > of how you want to compare them. The details of the standardization process > are not incredibly important, but it does need to take care of the things > you care about when comparing molecules. For example, if you don't care > about differences between salts, it should strip salts. If you don't care > about differences between tautomers, it should normalize tautomers. > fact 2: The InChI algorithm includes a standardization step that > normalizes some tautomers, but does not remove salts. > fact 3: The InChI representation contain a number of layers defining the > structure in increasing detail (this isn't strictly true, because some of > the choices about how layers are ordered are arbitrary, but it's close). > fact 4: canonicalization, the way I define it, produces a canonical atom > numbering for a given structure, but it does *not* standardize > fact 5: the RDKit has essentially no well-documented standardization code > > fact X: we don't have any standard, broadly accepted approach for > standardization, canonicalization or representation that is fool-proof or > that works for even all of organic chemistry, never mind organometallics. > InChI, useful as it is for some things, completely fails to handle things > like atropisomers (they are working on this kind of thing, but it's not out > yet). > > Given all of this, if I wanted to have flexible duplicate checking *right* > now, I think I would use the AvalonTools struchk functionality that the > RDKit provides (the new pure-RDKit version still needs a bit more testing) > to handle basic standardization and salt stripping and then produce a table > that includes the InChI in a couple of different forms. I'd want to be able > to recognize molecules that differ only by stereochemistry, molecules that > differ only by location of tautomeric Hs, and molecules that differ only by > the location of isotopic labels. You can do this with various clever splits > of the InChI (how to do it is left as an exercise for the reader and/or a > future RDKit blog post). > > I think there's something fun to be done here with SMILES variants, > borrowing heavily from some of the things that Roger has written about: > https://nextmovesoftware.com/blog/2013/04/25/finding-all-typ > es-of-every-mer/ > here's a more recent application of that from Noel: > https://nextmovesoftware.com/blog/2016/06/22/fishing-for-mat > ched-series-in-a-sea-of-structure-representations/ > > If I didn't really care about details and just wanted something that I > could explain easily to others, I'd skip all the complication and just use > InChIs (or InChI keys) to recognize duplicates. There would be times when > that would be the wrong answer, but it would be a broadly accepted kind of > wrong.[1] > > Regardless of the approach, I would not, under most any circumstances, > discard the original input structures that I had. It's really good to be > able to figure out what the original data looked like later. > > -greg > [1] I'm crying as I write this... > > > > > On Mon, Nov 28, 2016 at 5:25 PM, Stephen O'hagan > wrote: > >> Has anyone come up with fool-proof way of matching structurally >> equivalent molecules? >> >> >> >> Unique Smiles or InChI String comparisons don’t appear to work presumable >> because there are different but equivalent structures, e.g. explicit vs >> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form, >> tautomers etc. >> >> >> >> I also expect that comparing InChI strings might need something more than >> just a simple string comparison, such as masking off stereo information >> when you don’t care about stereo