Sorry Steve, there was a bug in MolVS that you encountered. Should now be fixed.

"pip install -U molvs" to get the update (v0.0.7).

Matt

> On 1 Dec 2016, at 15:52, Stephen O'hagan <soha...@manchester.ac.uk> wrote:
> 
> Thanks for the interesting links.
>  
> MolVS looks good, but failed on ‘NC(CC(=O)O)C(=O)[O-].O.O.[Na+]’ which isn’t 
> that extraordinary…
>  
> Couldn’t get Standardise to work at all, even on the example given; API not 
> intuitive or docs wrong or out of date.
>  
> I will have a look at the info in the UniChem paper, though not inclined to 
> use a web service for what I want to do.
>  
> Cheers,
> Steve.
>  
> From: George Papadatos [mailto:gpapada...@gmail.com] 
> Sent: 01 December 2016 14:26
> To: Greg Landrum <greg.land...@gmail.com>
> Cc: Stephen O'hagan <soha...@manchester.ac.uk>; 
> rdkit-discuss@lists.sourceforge.net; Francis Atkinson <fran...@ebi.ac.uk>
> Subject: Re: [Rdkit-discuss] comparing two or more tables of molecules
>  
> HI Stephen,
>  
> Further to Greg's excellent reply, see this paper on how InChI strings and 
> keys can be used in practice to map together tautomer (ones covered by InChI 
> at least), isotope, stereo and parent-salt variants. 
> http://rd.springer.com/article/10.1186/s13321-014-0043-5 
> <http://rd.springer.com/article/10.1186/s13321-014-0043-5>
>  
> Francis (cc'ed) has a nice notebook somewhere illustrating these nice InChI 
> splits to find these variants.  
>  
> For educational purposes, there have been other approaches like the NCI's 
> identifiers - discussion here: 
> http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf 
> <http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf>
>  
> For pure structure standardization using RDKit see here: 
> https://github.com/flatkinson/standardiser 
> <https://github.com/flatkinson/standardiser>
> and 
> https://github.com/mcs07/MolVS <https://github.com/mcs07/MolVS>
>  
>  
> Cheers, 
>  
> George
>  
>  
>  
>  
> On 29 November 2016 at 17:02, Greg Landrum <greg.land...@gmail.com 
> <mailto:greg.land...@gmail.com>> wrote:
> Wow, this is a great question and quite a fun thread.
>  
> It's hard to really make much of a contribution here without writing a 
> book/review article (something that I'm really not willing to do!), but I 
> have a few thoughts. Most of this is repeating/rephrasing things others have 
> already said.
>  
> I'm going to propose some things as facts. I think that these won't be 
> controversial:
> fact 1: if the structures are coming from different sources, they need to be 
> standardized/normalized before you compare them. This is true regardless of 
> how you want to compare them. The details of the standardization process are 
> not incredibly important, but it does need to take care of the things you 
> care about when comparing molecules. For example, if you don't care about 
> differences between salts, it should strip salts. If you don't care about 
> differences between tautomers, it should normalize tautomers.
> fact 2: The InChI algorithm includes a standardization step that normalizes 
> some tautomers, but does not remove salts.
> fact 3: The InChI representation contain a number of layers defining the 
> structure in increasing detail (this isn't strictly true, because some of the 
> choices about how layers are ordered are arbitrary, but it's close).
> fact 4: canonicalization, the way I define it, produces a canonical atom 
> numbering for a given structure, but it does *not* standardize
> fact 5: the RDKit has essentially no well-documented standardization code
>  
> fact X: we don't have any standard, broadly accepted approach for 
> standardization, canonicalization or representation that is fool-proof or 
> that works for even all of organic chemistry, never mind organometallics. 
> InChI, useful as it is for some things, completely fails to handle things 
> like atropisomers (they are working on this kind of thing, but it's not out 
> yet).
>  
> Given all of this, if I wanted to have flexible duplicate checking *right* 
> now, I think I would use the AvalonTools struchk functionality that the RDKit 
> provides (the new pure-RDKit version still needs a bit more testing) to 
> handle basic standardization and salt stripping and then produce a table that 
> includes the InChI in a couple of different forms. I'd want to be able to 
> recognize molecules that differ only by stereochemistry, molecules that 
> differ only by location of tautomeric Hs, and molecules that differ only by 
> the location of isotopic labels. You can do this with various clever splits 
> of the InChI (how to do it is left as an exercise for the reader and/or a 
> future RDKit blog post). 
>  
> I think there's something fun to be done here with SMILES variants, borrowing 
> heavily from some of the things that Roger has written about:
> https://nextmovesoftware.com/blog/2013/04/25/finding-all-types-of-every-mer/ 
> <https://nextmovesoftware.com/blog/2013/04/25/finding-all-types-of-every-mer/>
> here's a more recent application of that from Noel: 
> https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/
>  
> <https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/>
>  
> If I didn't really care about details and just wanted something that I could 
> explain easily to others, I'd skip all the complication and just use InChIs 
> (or InChI keys) to recognize duplicates. There would be times when that would 
> be the wrong answer, but it would be a broadly accepted kind of wrong.[1]
>  
> Regardless of the approach, I would not, under most any circumstances, 
> discard the original input structures that I had. It's really good to be able 
> to figure out what the original data looked like later.
>  
> -greg
> [1] I'm crying as I write this...
>  
>  
>  
>  
> On Mon, Nov 28, 2016 at 5:25 PM, Stephen O'hagan <soha...@manchester.ac.uk 
> <mailto:soha...@manchester.ac.uk>> wrote:
> Has anyone come up with fool-proof way of matching structurally equivalent 
> molecules?
>  
> Unique Smiles or InChI String comparisons don’t appear to work presumable 
> because there are different but equivalent structures, e.g. explicit vs 
> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form, 
> tautomers etc.
>  
> I also expect that comparing InChI strings might need something more than 
> just a simple string comparison, such as masking off stereo information when 
> you don’t care about stereo isomers.
>  
> I assume there are suitable tools within RDKit that can do this?
>  
> N.B. I need to collate tables from several sources that have a mix of smiles 
> / InChI / sdf molecular representations.
>  
> I usually use RDKit via Python and/or Knime.
>  
> Cheers,
> Steve.
>  
>  
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net 
> <mailto:Rdkit-discuss@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss 
> <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
>  
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net 
> <mailto:Rdkit-discuss@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss 
> <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
>  
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to