Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-12-04 Thread Matthew Swain
Sorry Steve, there was a bug in MolVS that you encountered. Should now be fixed.

"pip install -U molvs" to get the update (v0.0.7).

Matt

> On 1 Dec 2016, at 15:52, Stephen O'hagan <soha...@manchester.ac.uk> wrote:
> 
> Thanks for the interesting links.
>  
> MolVS looks good, but failed on ‘NC(CC(=O)O)C(=O)[O-].O.O.[Na+]’ which isn’t 
> that extraordinary…
>  
> Couldn’t get Standardise to work at all, even on the example given; API not 
> intuitive or docs wrong or out of date.
>  
> I will have a look at the info in the UniChem paper, though not inclined to 
> use a web service for what I want to do.
>  
> Cheers,
> Steve.
>  
> From: George Papadatos [mailto:gpapada...@gmail.com] 
> Sent: 01 December 2016 14:26
> To: Greg Landrum <greg.land...@gmail.com>
> Cc: Stephen O'hagan <soha...@manchester.ac.uk>; 
> rdkit-discuss@lists.sourceforge.net; Francis Atkinson <fran...@ebi.ac.uk>
> Subject: Re: [Rdkit-discuss] comparing two or more tables of molecules
>  
> HI Stephen,
>  
> Further to Greg's excellent reply, see this paper on how InChI strings and 
> keys can be used in practice to map together tautomer (ones covered by InChI 
> at least), isotope, stereo and parent-salt variants. 
> http://rd.springer.com/article/10.1186/s13321-014-0043-5 
> <http://rd.springer.com/article/10.1186/s13321-014-0043-5>
>  
> Francis (cc'ed) has a nice notebook somewhere illustrating these nice InChI 
> splits to find these variants.  
>  
> For educational purposes, there have been other approaches like the NCI's 
> identifiers - discussion here: 
> http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf 
> <http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf>
>  
> For pure structure standardization using RDKit see here: 
> https://github.com/flatkinson/standardiser 
> <https://github.com/flatkinson/standardiser>
> and 
> https://github.com/mcs07/MolVS <https://github.com/mcs07/MolVS>
>  
>  
> Cheers, 
>  
> George
>  
>  
>  
>  
> On 29 November 2016 at 17:02, Greg Landrum <greg.land...@gmail.com 
> <mailto:greg.land...@gmail.com>> wrote:
> Wow, this is a great question and quite a fun thread.
>  
> It's hard to really make much of a contribution here without writing a 
> book/review article (something that I'm really not willing to do!), but I 
> have a few thoughts. Most of this is repeating/rephrasing things others have 
> already said.
>  
> I'm going to propose some things as facts. I think that these won't be 
> controversial:
> fact 1: if the structures are coming from different sources, they need to be 
> standardized/normalized before you compare them. This is true regardless of 
> how you want to compare them. The details of the standardization process are 
> not incredibly important, but it does need to take care of the things you 
> care about when comparing molecules. For example, if you don't care about 
> differences between salts, it should strip salts. If you don't care about 
> differences between tautomers, it should normalize tautomers.
> fact 2: The InChI algorithm includes a standardization step that normalizes 
> some tautomers, but does not remove salts.
> fact 3: The InChI representation contain a number of layers defining the 
> structure in increasing detail (this isn't strictly true, because some of the 
> choices about how layers are ordered are arbitrary, but it's close).
> fact 4: canonicalization, the way I define it, produces a canonical atom 
> numbering for a given structure, but it does *not* standardize
> fact 5: the RDKit has essentially no well-documented standardization code
>  
> fact X: we don't have any standard, broadly accepted approach for 
> standardization, canonicalization or representation that is fool-proof or 
> that works for even all of organic chemistry, never mind organometallics. 
> InChI, useful as it is for some things, completely fails to handle things 
> like atropisomers (they are working on this kind of thing, but it's not out 
> yet).
>  
> Given all of this, if I wanted to have flexible duplicate checking *right* 
> now, I think I would use the AvalonTools struchk functionality that the RDKit 
> provides (the new pure-RDKit version still needs a bit more testing) to 
> handle basic standardization and salt stripping and then produce a table that 
> includes the InChI in a couple of different forms. I'd want to be able to 
> recognize molecules that differ only by stereochemistry, molecules that 
> differ only by location of tautomeric Hs, and molecules that differ only by 
> the location of isotopic labels. You can do this with various clever splits 
> of the InChI (how to 

Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-12-01 Thread Markus Sitzmann
Well, since George mentioned a talk by me, I wish we would have implemented
our tool back then using an open-source tool like RDKit (which wasn't very
well know back then), and also would have been so smart to use SMARTS for
the transformation rules (partially they are implemented as SMARTS but big
parts are other CACTVS script functionalities).

There is still an intention by me to continue/advance (whatever) on this
and make it openly available, but I must admit it is a quite vague
intention currently.

Markus
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-12-01 Thread Stephen O'hagan
Thanks for the interesting links.

MolVS looks good, but failed on ‘NC(CC(=O)O)C(=O)[O-].O.O.[Na+]’ which isn’t 
that extraordinary…

Couldn’t get Standardise to work at all, even on the example given; API not 
intuitive or docs wrong or out of date.

I will have a look at the info in the UniChem paper, though not inclined to use 
a web service for what I want to do.

Cheers,
Steve.

From: George Papadatos [mailto:gpapada...@gmail.com]
Sent: 01 December 2016 14:26
To: Greg Landrum <greg.land...@gmail.com>
Cc: Stephen O'hagan <soha...@manchester.ac.uk>; 
rdkit-discuss@lists.sourceforge.net; Francis Atkinson <fran...@ebi.ac.uk>
Subject: Re: [Rdkit-discuss] comparing two or more tables of molecules

HI Stephen,

Further to Greg's excellent reply, see this paper on how InChI strings and keys 
can be used in practice to map together tautomer (ones covered by InChI at 
least), isotope, stereo and parent-salt variants.
http://rd.springer.com/article/10.1186/s13321-014-0043-5

Francis (cc'ed) has a nice notebook somewhere illustrating these nice InChI 
splits to find these variants.

For educational purposes, there have been other approaches like the NCI's 
identifiers - discussion here:
http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf

For pure structure standardization using RDKit see here:
https://github.com/flatkinson/standardiser
and
https://github.com/mcs07/MolVS


Cheers,

George




On 29 November 2016 at 17:02, Greg Landrum 
<greg.land...@gmail.com<mailto:greg.land...@gmail.com>> wrote:
Wow, this is a great question and quite a fun thread.

It's hard to really make much of a contribution here without writing a 
book/review article (something that I'm really not willing to do!), but I have 
a few thoughts. Most of this is repeating/rephrasing things others have already 
said.

I'm going to propose some things as facts. I think that these won't be 
controversial:
fact 1: if the structures are coming from different sources, they need to be 
standardized/normalized before you compare them. This is true regardless of how 
you want to compare them. The details of the standardization process are not 
incredibly important, but it does need to take care of the things you care 
about when comparing molecules. For example, if you don't care about 
differences between salts, it should strip salts. If you don't care about 
differences between tautomers, it should normalize tautomers.
fact 2: The InChI algorithm includes a standardization step that normalizes 
some tautomers, but does not remove salts.
fact 3: The InChI representation contain a number of layers defining the 
structure in increasing detail (this isn't strictly true, because some of the 
choices about how layers are ordered are arbitrary, but it's close).
fact 4: canonicalization, the way I define it, produces a canonical atom 
numbering for a given structure, but it does *not* standardize
fact 5: the RDKit has essentially no well-documented standardization code

fact X: we don't have any standard, broadly accepted approach for 
standardization, canonicalization or representation that is fool-proof or that 
works for even all of organic chemistry, never mind organometallics. InChI, 
useful as it is for some things, completely fails to handle things like 
atropisomers (they are working on this kind of thing, but it's not out yet).

Given all of this, if I wanted to have flexible duplicate checking *right* now, 
I think I would use the AvalonTools struchk functionality that the RDKit 
provides (the new pure-RDKit version still needs a bit more testing) to handle 
basic standardization and salt stripping and then produce a table that includes 
the InChI in a couple of different forms. I'd want to be able to recognize 
molecules that differ only by stereochemistry, molecules that differ only by 
location of tautomeric Hs, and molecules that differ only by the location of 
isotopic labels. You can do this with various clever splits of the InChI (how 
to do it is left as an exercise for the reader and/or a future RDKit blog post).

I think there's something fun to be done here with SMILES variants, borrowing 
heavily from some of the things that Roger has written about:
https://nextmovesoftware.com/blog/2013/04/25/finding-all-types-of-every-mer/
here's a more recent application of that from Noel: 
https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/

If I didn't really care about details and just wanted something that I could 
explain easily to others, I'd skip all the complication and just use InChIs (or 
InChI keys) to recognize duplicates. There would be times when that would be 
the wrong answer, but it would be a broadly accepted kind of wrong.[1]

Regardless of the approach, I would not, under most any circumstances, discard 
the original input structures that I had. It's really good to be able to figure 
out what the original data looked like lat

Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-12-01 Thread George Papadatos
HI Stephen,

Further to Greg's excellent reply, see this paper on how InChI strings and
keys can be used in practice to map together tautomer (ones covered by
InChI at least), isotope, stereo and parent-salt variants.
http://rd.springer.com/article/10.1186/s13321-014-0043-5

Francis (cc'ed) has a nice notebook somewhere illustrating these nice InChI
splits to find these variants.

For educational purposes, there have been other approaches like the NCI's
identifiers - discussion here:
http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf

For pure structure standardization using RDKit see here:
https://github.com/flatkinson/standardiser
and
https://github.com/mcs07/MolVS


Cheers,

George




On 29 November 2016 at 17:02, Greg Landrum  wrote:

> Wow, this is a great question and quite a fun thread.
>
> It's hard to really make much of a contribution here without writing a
> book/review article (something that I'm really not willing to do!), but I
> have a few thoughts. Most of this is repeating/rephrasing things others
> have already said.
>
> I'm going to propose some things as facts. I think that these won't be
> controversial:
> fact 1: if the structures are coming from different sources, they need to
> be standardized/normalized before you compare them. This is true regardless
> of how you want to compare them. The details of the standardization process
> are not incredibly important, but it does need to take care of the things
> you care about when comparing molecules. For example, if you don't care
> about differences between salts, it should strip salts. If you don't care
> about differences between tautomers, it should normalize tautomers.
> fact 2: The InChI algorithm includes a standardization step that
> normalizes some tautomers, but does not remove salts.
> fact 3: The InChI representation contain a number of layers defining the
> structure in increasing detail (this isn't strictly true, because some of
> the choices about how layers are ordered are arbitrary, but it's close).
> fact 4: canonicalization, the way I define it, produces a canonical atom
> numbering for a given structure, but it does *not* standardize
> fact 5: the RDKit has essentially no well-documented standardization code
>
> fact X: we don't have any standard, broadly accepted approach for
> standardization, canonicalization or representation that is fool-proof or
> that works for even all of organic chemistry, never mind organometallics.
> InChI, useful as it is for some things, completely fails to handle things
> like atropisomers (they are working on this kind of thing, but it's not out
> yet).
>
> Given all of this, if I wanted to have flexible duplicate checking *right*
> now, I think I would use the AvalonTools struchk functionality that the
> RDKit provides (the new pure-RDKit version still needs a bit more testing)
> to handle basic standardization and salt stripping and then produce a table
> that includes the InChI in a couple of different forms. I'd want to be able
> to recognize molecules that differ only by stereochemistry, molecules that
> differ only by location of tautomeric Hs, and molecules that differ only by
> the location of isotopic labels. You can do this with various clever splits
> of the InChI (how to do it is left as an exercise for the reader and/or a
> future RDKit blog post).
>
> I think there's something fun to be done here with SMILES variants,
> borrowing heavily from some of the things that Roger has written about:
> https://nextmovesoftware.com/blog/2013/04/25/finding-all-typ
> es-of-every-mer/
> here's a more recent application of that from Noel:
> https://nextmovesoftware.com/blog/2016/06/22/fishing-for-mat
> ched-series-in-a-sea-of-structure-representations/
>
> If I didn't really care about details and just wanted something that I
> could explain easily to others, I'd skip all the complication and just use
> InChIs (or InChI keys) to recognize duplicates. There would be times when
> that would be the wrong answer, but it would be a broadly accepted kind of
> wrong.[1]
>
> Regardless of the approach, I would not, under most any circumstances,
> discard the original input structures that I had. It's really good to be
> able to figure out what the original data looked like later.
>
> -greg
> [1] I'm crying as I write this...
>
>
>
>
> On Mon, Nov 28, 2016 at 5:25 PM, Stephen O'hagan  > wrote:
>
>> Has anyone come up with fool-proof way of matching structurally
>> equivalent molecules?
>>
>>
>>
>> Unique Smiles or InChI String comparisons don’t appear to work presumable
>> because there are different but equivalent structures, e.g. explicit vs
>> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form,
>> tautomers etc.
>>
>>
>>
>> I also expect that comparing InChI strings might need something more than
>> just a simple string comparison, such as masking off stereo information
>> when you don’t care about stereo 

Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-29 Thread Dimitri Maziuk
On 11/29/2016 11:56 AM, Chris Swain wrote:

> However I’ve found that the success is very much dependent on the
> fact 1 described by Greg, get all the structures standardised then comparison
using canonical SMILES or InChi seems to work fine.

+1. Essentially you need to get standardized representation of all the
properties you consider relevant and produce a unique hash of that.
Doesn't matter if it's a SHA-1 string or some graph-based magic or a
matrix voodoo. (String comparison is of course easier.)

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-29 Thread Chris Swain
Hi,

I’ve done this sort of comparison many times I actually use a SVL script within 
Moe and there is an example here 
(http://www.cambridgemedchemconsulting.com/resources/hit_identification/fragment_collection_profiles.html
 
)

However I’ve found that the success is very much dependent on the fact 1 
described by Greg, get all the structures standardised then comparison using 
canonical SMILES or InChi seems to work fine.

Cheers,

Chris

> On 29 Nov 2016, at 17:24, rdkit-discuss-requ...@lists.sourceforge.net wrote:
> 
> fact 1: if the structures are coming from different sources, they need to
> be standardized/normalized before you compare them. This is true regardless
> of how you want to compare them. The details of the standardization process
> are not incredibly important, but it does need to take care of the things
> you care about when comparing molecules. For example, if you don't care
> about differences between salts, it should strip salts. If you don't care
> about differences between tautomers, it should normalize tautomers.

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-29 Thread Greg Landrum
Wow, this is a great question and quite a fun thread.

It's hard to really make much of a contribution here without writing a
book/review article (something that I'm really not willing to do!), but I
have a few thoughts. Most of this is repeating/rephrasing things others
have already said.

I'm going to propose some things as facts. I think that these won't be
controversial:
fact 1: if the structures are coming from different sources, they need to
be standardized/normalized before you compare them. This is true regardless
of how you want to compare them. The details of the standardization process
are not incredibly important, but it does need to take care of the things
you care about when comparing molecules. For example, if you don't care
about differences between salts, it should strip salts. If you don't care
about differences between tautomers, it should normalize tautomers.
fact 2: The InChI algorithm includes a standardization step that normalizes
some tautomers, but does not remove salts.
fact 3: The InChI representation contain a number of layers defining the
structure in increasing detail (this isn't strictly true, because some of
the choices about how layers are ordered are arbitrary, but it's close).
fact 4: canonicalization, the way I define it, produces a canonical atom
numbering for a given structure, but it does *not* standardize
fact 5: the RDKit has essentially no well-documented standardization code

fact X: we don't have any standard, broadly accepted approach for
standardization, canonicalization or representation that is fool-proof or
that works for even all of organic chemistry, never mind organometallics.
InChI, useful as it is for some things, completely fails to handle things
like atropisomers (they are working on this kind of thing, but it's not out
yet).

Given all of this, if I wanted to have flexible duplicate checking *right*
now, I think I would use the AvalonTools struchk functionality that the
RDKit provides (the new pure-RDKit version still needs a bit more testing)
to handle basic standardization and salt stripping and then produce a table
that includes the InChI in a couple of different forms. I'd want to be able
to recognize molecules that differ only by stereochemistry, molecules that
differ only by location of tautomeric Hs, and molecules that differ only by
the location of isotopic labels. You can do this with various clever splits
of the InChI (how to do it is left as an exercise for the reader and/or a
future RDKit blog post).

I think there's something fun to be done here with SMILES variants,
borrowing heavily from some of the things that Roger has written about:
https://nextmovesoftware.com/blog/2013/04/25/finding-all-types-of-every-mer/
here's a more recent application of that from Noel:
https://nextmovesoftware.com/blog/2016/06/22/fishing-for-
matched-series-in-a-sea-of-structure-representations/

If I didn't really care about details and just wanted something that I
could explain easily to others, I'd skip all the complication and just use
InChIs (or InChI keys) to recognize duplicates. There would be times when
that would be the wrong answer, but it would be a broadly accepted kind of
wrong.[1]

Regardless of the approach, I would not, under most any circumstances,
discard the original input structures that I had. It's really good to be
able to figure out what the original data looked like later.

-greg
[1] I'm crying as I write this...




On Mon, Nov 28, 2016 at 5:25 PM, Stephen O'hagan 
wrote:

> Has anyone come up with fool-proof way of matching structurally equivalent
> molecules?
>
>
>
> Unique Smiles or InChI String comparisons don’t appear to work presumable
> because there are different but equivalent structures, e.g. explicit vs
> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form,
> tautomers etc.
>
>
>
> I also expect that comparing InChI strings might need something more than
> just a simple string comparison, such as masking off stereo information
> when you don’t care about stereo isomers.
>
>
>
> I assume there are suitable tools within RDKit that can do this?
>
>
>
> N.B. I need to collate tables from several sources that have a mix of
> smiles / InChI / sdf molecular representations.
>
>
>
> I usually use RDKit via Python and/or Knime.
>
>
>
> Cheers,
>
> Steve.
>
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-29 Thread Patrick Walters
The Layered InChI (LyChi), developed by Trung Nguyen at NCATS was designed
to directly address the problem you describe.  I don't have any first hand
experience with this method (yet), but it looks intriguing.

https://github.com/ncats/lychi


Pat

On Mon, Nov 28, 2016 at 11:25 AM, Stephen O'hagan 
wrote:

> Has anyone come up with fool-proof way of matching structurally equivalent
> molecules?
>
>
>
> Unique Smiles or InChI String comparisons don’t appear to work presumable
> because there are different but equivalent structures, e.g. explicit vs
> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form,
> tautomers etc.
>
>
>
> I also expect that comparing InChI strings might need something more than
> just a simple string comparison, such as masking off stereo information
> when you don’t care about stereo isomers.
>
>
>
> I assume there are suitable tools within RDKit that can do this?
>
>
>
> N.B. I need to collate tables from several sources that have a mix of
> smiles / InChI / sdf molecular representations.
>
>
>
> I usually use RDKit via Python and/or Knime.
>
>
>
> Cheers,
>
> Steve.
>
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-28 Thread Dimitri Maziuk
On 11/28/2016 10:25 AM, Stephen O'hagan wrote:
> Has anyone come up with fool-proof way of matching structurally equivalent 
> molecules?

This is somewhat convoluted and there is no proof that it's fool-proof.

A few years ago we had good results from running graphpowerhash()
function here:
http://madgik.github.io/madis/aggregate.html#module-functions.aggregate.graph
on the PDB ligand database.

The parameters were

- atom1, atom2 IDs (names) as node1, node2.

- Atom stereo (R, S, N), aromatic (y/n), and "leaving atom" (y/n) for
the atoms as node1_details, node2_details (packed into single string
with jpack() function: see http://madgik.github.io/madis/row.html).

Looking at it now, I don't think nodeN_details parameter needs to
include atom's "aromatic" flag.

- Massaged bond type and bond stereo (E, Z, N) as edge_details. Also
packed into a string as above.

PDB chem comp model has bond type as SING or DOUB with a separate yes/no
"aromatic" column. We changed it to AROM for the ones where that was a yes.

The basic model is a list of bonds with atom1, atom2, and type, and a
list of atoms with stereo, aromatic, and "leaving" flags -- the last one
is "Y" for atoms that "go away" when forming a bond.

The algorithm itself, as far as I know (I am not the author), takes the
two "matrices" representing the molecule "graphs", computes their
largest eigenvalue/eigenvectors, and compares those. We have no proof
that it's 100% correct, but all duplicates it found in the PDB ligand
expo at the time were genuine.

Enjoy,
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-28 Thread Rocco Moretti
On Mon, Nov 28, 2016 at 11:31 AM, Christos Kannas 
wrote:

I think it would be better to use a similarity metric based on fingerprints.
>

Hi Christos,

Fingerprints will only work if the fingerprint method you use captures all
of the salient information you're interested in. For example, most
fingerprint metrics in use have spotty or non-existent encoding of
chirality, so if you want to consider two enantiomers to be different,
fingerprint similarity will not work for you. (Unless you happen to pick a
fingerprint method which happens to encode the particular chirality
information you're interested in.)

E.g.

>>> m1 = Chem.MolFromSmiles("CC1=CC[C@](Cl)(CC1)C(=C)C")
> >>> m2 = Chem.MolFromSmiles("CC1=CC[C@@](Cl)(CC1)C(=C)C")
> >>> FingerprintSimilarity(FingerprintMol(m1),FingerprintMol(m2))
> 1.0
>

Even regioisomers can fool a fingerprint-based method, for certain
regioisomers:

>>> m1 = Chem.MolFromSmiles("N(CCC[Br])O")
> >>> m2 = Chem.MolFromSmiles("N(CCCO)[Br]")
> >>> FingerprintSimilarity(FingerprintMol(m1),FingerprintMol(m2))
> 1.0
>

(That's 7 versus 8 carbons on each aliphatic chain.)

I agree with Rajarshi that a SMILES based approach will probably work, if
you make sure you properly canonicalize the SMILES.

The default RDKit SMILES output should work for most molecules. RDKit will
canonicalize the SMILES by default (though keep in mind different programs
have different SMILES canonicalization routines, so only compare RDKit
canonical smiles with other RDKit canonical SMILES). Also, RDKit normally
removes hydrogens on structures it reads in, so passing the molecule
through RDKit will give you a SMILES without (non-critical) hydrogens. By
default it will also output things labeled aromatically, so you don't have
to worry about Kekulization differences.

If you care about stereo-isomer differences, the one thing you probably
will want to change from the defaults is to add "isomericSmiles=True" to
the calls to MolToSmiles(), otherwise you'll lose the chirality information
when you write out your SMILES.

Tautomer and charged forms are going to be the big drawback here.
Especially with things like imidazole-like rings, RDKit can be particular
with hydrogen tautomerization, considering them to be different molecules.

>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1nc(Cl)cn1"))
>
# Doesn't work: Sanitization error
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH]c(Cl)cn1"))
>
'Clc1cnc[nH]1'
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1nc(Cl)c[nH]1"))
>
'Clc1c[nH]cn1'
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH]c(Cl)c[nH+]1"))
>
'Clc1c[nH+]c[nH]1'
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH+]c(Cl)c[nH]1"))
>
'Clc1c[nH]c[nH+]1'
>

That difference stays even after attempting to remove hydrogens from the
molecule.

Regards,
-Rocco
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-28 Thread Rajarshi Guha
It really boils down to how you standardize molecules such that you end up
with a canonical structure.

SMILES not the issue here - if you standardizer does a proper job with
aromaticity, tautomers etc then you can get a canonical SMILES.

You can use the InChI model as well as to generate a canonical SMILES (
https://jcheminf.springeropen.com/articles/10.1186/1758-2946-4-22).

This doesn't really answer your question, as I'm not familiar with RDKit
functionality for standardization.

(As an aside, internally we use https://github.com/ncats/lychi which is
conceptually similar to InChI)

PS. I don't think this is a job for fingerprint based similarity methods
though

On Mon, Nov 28, 2016 at 11:25 AM, Stephen O'hagan 
wrote:

> Has anyone come up with fool-proof way of matching structurally equivalent
> molecules?
>
>
>
> Unique Smiles or InChI String comparisons don’t appear to work presumable
> because there are different but equivalent structures, e.g. explicit vs
> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form,
> tautomers etc.
>
>
>
> I also expect that comparing InChI strings might need something more than
> just a simple string comparison, such as masking off stereo information
> when you don’t care about stereo isomers.
>
>
>
> I assume there are suitable tools within RDKit that can do this?
>
>
>
> N.B. I need to collate tables from several sources that have a mix of
> smiles / InChI / sdf molecular representations.
>
>
>
> I usually use RDKit via Python and/or Knime.
>
>
>
> Cheers,
>
> Steve.
>
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>


-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-28 Thread Christos Kannas
Hi Steve,

I think it would be better to use a similarity metric based on fingerprints.

Regards,

Christos

Christos Kannas

Researcher
Ph.D Student

[image: View Christos Kannas's profile on LinkedIn]


On 28 November 2016 at 18:25, Stephen O'hagan 
wrote:

> Has anyone come up with fool-proof way of matching structurally equivalent
> molecules?
>
>
>
> Unique Smiles or InChI String comparisons don’t appear to work presumable
> because there are different but equivalent structures, e.g. explicit vs
> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form,
> tautomers etc.
>
>
>
> I also expect that comparing InChI strings might need something more than
> just a simple string comparison, such as masking off stereo information
> when you don’t care about stereo isomers.
>
>
>
> I assume there are suitable tools within RDKit that can do this?
>
>
>
> N.B. I need to collate tables from several sources that have a mix of
> smiles / InChI / sdf molecular representations.
>
>
>
> I usually use RDKit via Python and/or Knime.
>
>
>
> Cheers,
>
> Steve.
>
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss