Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching

2014-03-07 Thread Toby Wright
Thanks Greg,

The final strange behaviour I've noticed that could trip fellow users up is
with matching kekule verses aromatic representations of the same molecule
in SMARTS against SMILES. Most surprisingly C1=CC=CC=C1 is not a
substructure of itself but has c1c1 as a substructure (if the lefthand
term is SMILES and the right is SMARTS in both cases).
Code to demonstrate what I mean below:

 aromatic_benzene_smiles = Chem.MolFromSmiles('c1c1')
 aromatic_benzene_smarts = Chem.MolFromSmarts('c1c1')
 kekule_benzene_smiles = Chem.MolFromSmiles('C1=CC=CC=C1')
 kekule_benzene_smarts = Chem.MolFromSmarts('C1=CC=CC=C1')
 aromatic_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts)
True
 aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smiles)
True
 aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts)
False
 kekule_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts)
False
 kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smiles)
True
 kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts)
True

I think I can see why there is a difference in behaviour, a double bond is
not the same thing as an aromatic bond. In the SMILES case a conversion can
take place because the context is complete but in the SMARTS case it is not
(or at least might not be). But I thought I'd point out the issue in any
case. The workaround is to always explicitly make atoms aromatic in SMARTS
if you wish them to match aromatic SMILES rather than relying on the kekule
representation to sort it for you.

Yours,

Toby Wright

--
InhibOx Ltd


On 6 March 2014 04:55, Greg Landrum greg.land...@gmail.com wrote:



 On Wed, Mar 5, 2014 at 4:03 PM, Toby Wright toby.wri...@inhibox.comwrote:


 This is probably related to the above so I thought I'd post it on this
 thread. I am noticing inconsistent behaviour when a molecule created via
 SMARTS that contains an 'or' statement has HasSubstructMatch called on it,
 as opposed to it being the argument to HasSubstructMatch. A simple example
 follows:

  O_or_C = Chem.MolFromSmarts('[O,C]')
  O = Chem.MolFromSmiles('O')
  C = Chem.MolFromSmiles('C')
  O_or_C.HasSubstructMatch(O)
 True
  O_or_C.HasSubstructMatch(C)
 False
  O.HasSubstructMatch(O_or_C)
 True
  C.HasSubstructMatch(O_or_C)
 True

 We also see:
  C_or_O = Chem.MolFromSmarts('[C,O]')
  C_or_O.HasSubstructMatch(O)
 False
  C_or_O.HasSubstructMatch(C)
 True

 so the order of elements in a SMARTS 'or' statement changes the
 behaviour, which is unexpected.


 This is indeed related. This is a case I didn't cover above: the
 SMILES/SMARTS match. The behavior above is expected from the point of view
 of what's in the code, though I can understand how it may not make much
 sense from the perspective of someone using the code. :-) The above should
 probably return False in both cases.

 In general, one should probably expect that using the HasSubstructMatch()
 method of a molecule constructed from SMARTS is likely to produce strange
 results. Getting a general purpose query--query matcher to work is, as far
 as I can tell, a decidedly non-trivial problem.

 -greg


--
Subversion Kills Productivity. Get off Subversion  Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching

2014-03-07 Thread Greg Landrum
Hi Toby,

On Fri, Mar 7, 2014 at 11:57 AM, Toby Wright toby.wri...@inhibox.comwrote:

 Thanks Greg,

 The final strange behaviour I've noticed that could trip fellow users up
 is with matching kekule verses aromatic representations of the same
 molecule in SMARTS against SMILES. Most surprisingly C1=CC=CC=C1 is not a
 substructure of itself but has c1c1 as a substructure (if the lefthand
 term is SMILES and the right is SMARTS in both cases).
 Code to demonstrate what I mean below:

  aromatic_benzene_smiles = Chem.MolFromSmiles('c1c1')
  aromatic_benzene_smarts = Chem.MolFromSmarts('c1c1')
  kekule_benzene_smiles = Chem.MolFromSmiles('C1=CC=CC=C1')
  kekule_benzene_smarts = Chem.MolFromSmarts('C1=CC=CC=C1')
  aromatic_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts)
 True
  aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smiles)
 True
  aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts)
 False
  kekule_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts)
 False
  kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smiles)
 True
  kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts)
 True

 I think I can see why there is a difference in behaviour, a double bond is
 not the same thing as an aromatic bond. In the SMILES case a conversion can
 take place because the context is complete but in the SMARTS case it is not
 (or at least might not be). But I thought I'd point out the issue in any
 case. The workaround is to always explicitly make atoms aromatic in SMARTS
 if you wish them to match aromatic SMILES rather than relying on the kekule
 representation to sort it for you.


There have been a couple of discussion of related topics on the list
already. There is a nice discussion of this in the Daylight Theory Manual
page for SMARTS:
http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html look for
SMARTS versus SMILES

-greg
--
Subversion Kills Productivity. Get off Subversion  Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching

2014-03-05 Thread Christos Kannas
Hi Greg,

Thanks a lot for the explanation.
It makes things clearer now.
Well the reason I'm doing SMARTS-SMARTS match is because I would like to
match functional groups with the reactants in reactions.

Regards,

Christos

Christos Kannas

Researcher
Ph.D Student

Mob (UK): +44 (0) 7447700937
Mob (Cyprus): +357 99530608

[image: View Christos Kannas's profile on
LinkedIn]http://cy.linkedin.com/in/christoskannas


On 5 March 2014 04:44, Greg Landrum greg.land...@gmail.com wrote:

 Hi Christos,


 On Tue, Mar 4, 2014 at 3:46 PM, Christos Kannas chriskan...@gmail.comwrote:

 Hi all,

 Why does the following happen?

 In [1]: from rdkit import Chem
 In [2]: from rdkit.Chem import AllChem
 In [3]: from rdkit.Chem import Draw

 In [4]: patt = Chem.MolFromSmarts([CH;D2;!$(C-[!#6;!#1])]=O)

 In [5]: z2 = Chem.MolFromSmarts([*]-C-C([H])(=O), 1)
 In [6]: print Chem.MolToSmiles(z2)
 [*]CC=O
 In [7]: print Chem.MolToSmarts(z2)
 *-C-[C!H0]=O
 In [9]: z2.HasSubstructMatch(patt)
 Out[9]: False

 In [10]: z3 = Chem.MolFromSmiles(Chem.MolToSmiles(z2))
 In [11]: print Chem.MolToSmiles(z3)
 [*]CC=O
 In [12]: print Chem.MolToSmarts(z3)
 [*]-[#6]-[#6]=[#8]
 In [13]: z3.HasSubstructMatch(patt)
 Out[13]: True

 Shouldn't be that z2 and z3 have the same information?


 The way SMARTS/SMARTS matches is handled is different than the way
 SMARTS/SMILES matches works.
 The short answer is that when doing a SMARTS/SMARTS match, the RDKit
 compares the queries to each other; when doing a SMARTS/SMILES match, on
 the other hand, it checks to see if the atoms in the SMILES molecule match
 the queries in the SMARTS molecule.

 A bit longer answer:
 Molecules built using MolFromSmiles contain Atoms, molecules built using
 MolFromSmarts contain QueryAtoms. Both atoms and QueryAtoms have a Match()
 method that takes another Atom or QueryAtom as an argument and returns
 whether or not the two match.
 The substructure matching code makes heavy use of this Match() method.
 QueryAtom.Match(Atom) checks to see if the Atom satisfies the query.
 QueryAtom.Match(QueryAtom) checks to see if the queries on the atoms are
 the same. This uses a crude approach that is easy to fool, but I assume
 that a SMARTS-SMARTS match is not a frequent thing someone wants to do.
 query-query matching is also not a particularly easy problem to solve in a
 general way.

 -greg



--
Subversion Kills Productivity. Get off Subversion  Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching

2014-03-05 Thread Toby Wright
Hi,

This is probably related to the above so I thought I'd post it on this
thread. I am noticing inconsistent behaviour when a molecule created via
SMARTS that contains an 'or' statement has HasSubstructMatch called on it,
as opposed to it being the argument to HasSubstructMatch. A simple example
follows:

 O_or_C = Chem.MolFromSmarts('[O,C]')
 O = Chem.MolFromSmiles('O')
 C = Chem.MolFromSmiles('C')
 O_or_C.HasSubstructMatch(O)
True
 O_or_C.HasSubstructMatch(C)
False
 O.HasSubstructMatch(O_or_C)
True
 C.HasSubstructMatch(O_or_C)
True

We also see:
 C_or_O = Chem.MolFromSmarts('[C,O]')
 C_or_O.HasSubstructMatch(O)
False
 C_or_O.HasSubstructMatch(C)
True

so the order of elements in a SMARTS 'or' statement changes the behaviour,
which is unexpected.

Yours,

Toby Wright

--
InhibOx Ltd


On 5 March 2014 10:10, Christos Kannas chriskan...@gmail.com wrote:

 Hi Greg,

 Thanks a lot for the explanation.
 It makes things clearer now.
 Well the reason I'm doing SMARTS-SMARTS match is because I would like to
 match functional groups with the reactants in reactions.

 Regards,

 Christos

 Christos Kannas

 Researcher
 Ph.D Student

 Mob (UK): +44 (0) 7447700937
 Mob (Cyprus): +357 99530608

 [image: View Christos Kannas's profile on 
 LinkedIn]http://cy.linkedin.com/in/christoskannas


 On 5 March 2014 04:44, Greg Landrum greg.land...@gmail.com wrote:

 Hi Christos,


 On Tue, Mar 4, 2014 at 3:46 PM, Christos Kannas chriskan...@gmail.comwrote:

 Hi all,

 Why does the following happen?

 In [1]: from rdkit import Chem
 In [2]: from rdkit.Chem import AllChem
 In [3]: from rdkit.Chem import Draw

 In [4]: patt = Chem.MolFromSmarts([CH;D2;!$(C-[!#6;!#1])]=O)

 In [5]: z2 = Chem.MolFromSmarts([*]-C-C([H])(=O), 1)
 In [6]: print Chem.MolToSmiles(z2)
 [*]CC=O
 In [7]: print Chem.MolToSmarts(z2)
 *-C-[C!H0]=O
 In [9]: z2.HasSubstructMatch(patt)
 Out[9]: False

 In [10]: z3 = Chem.MolFromSmiles(Chem.MolToSmiles(z2))
 In [11]: print Chem.MolToSmiles(z3)
 [*]CC=O
 In [12]: print Chem.MolToSmarts(z3)
 [*]-[#6]-[#6]=[#8]
 In [13]: z3.HasSubstructMatch(patt)
 Out[13]: True

 Shouldn't be that z2 and z3 have the same information?


 The way SMARTS/SMARTS matches is handled is different than the way
 SMARTS/SMILES matches works.
  The short answer is that when doing a SMARTS/SMARTS match, the RDKit
 compares the queries to each other; when doing a SMARTS/SMILES match, on
 the other hand, it checks to see if the atoms in the SMILES molecule match
 the queries in the SMARTS molecule.

 A bit longer answer:
 Molecules built using MolFromSmiles contain Atoms, molecules built using
 MolFromSmarts contain QueryAtoms. Both atoms and QueryAtoms have a Match()
 method that takes another Atom or QueryAtom as an argument and returns
 whether or not the two match.
 The substructure matching code makes heavy use of this Match() method.
 QueryAtom.Match(Atom) checks to see if the Atom satisfies the query.
 QueryAtom.Match(QueryAtom) checks to see if the queries on the atoms are
 the same. This uses a crude approach that is easy to fool, but I assume
 that a SMARTS-SMARTS match is not a frequent thing someone wants to do.
 query-query matching is also not a particularly easy problem to solve in a
 general way.

 -greg






 --
 Subversion Kills Productivity. Get off Subversion  Make the Move to
 Perforce.
 With Perforce, you get hassle-free workflows. Merge that actually works.
 Faster operations. Version large binaries.  Built-in WAN optimization and
 the
 freedom to use Git, Perforce or both. Make the move to Perforce.

 http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Subversion Kills Productivity. Get off Subversion  Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss