Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching
Thanks Greg, The final strange behaviour I've noticed that could trip fellow users up is with matching kekule verses aromatic representations of the same molecule in SMARTS against SMILES. Most surprisingly C1=CC=CC=C1 is not a substructure of itself but has c1c1 as a substructure (if the lefthand term is SMILES and the right is SMARTS in both cases). Code to demonstrate what I mean below: aromatic_benzene_smiles = Chem.MolFromSmiles('c1c1') aromatic_benzene_smarts = Chem.MolFromSmarts('c1c1') kekule_benzene_smiles = Chem.MolFromSmiles('C1=CC=CC=C1') kekule_benzene_smarts = Chem.MolFromSmarts('C1=CC=CC=C1') aromatic_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts) True aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smiles) True aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts) False kekule_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts) False kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smiles) True kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts) True I think I can see why there is a difference in behaviour, a double bond is not the same thing as an aromatic bond. In the SMILES case a conversion can take place because the context is complete but in the SMARTS case it is not (or at least might not be). But I thought I'd point out the issue in any case. The workaround is to always explicitly make atoms aromatic in SMARTS if you wish them to match aromatic SMILES rather than relying on the kekule representation to sort it for you. Yours, Toby Wright -- InhibOx Ltd On 6 March 2014 04:55, Greg Landrum greg.land...@gmail.com wrote: On Wed, Mar 5, 2014 at 4:03 PM, Toby Wright toby.wri...@inhibox.comwrote: This is probably related to the above so I thought I'd post it on this thread. I am noticing inconsistent behaviour when a molecule created via SMARTS that contains an 'or' statement has HasSubstructMatch called on it, as opposed to it being the argument to HasSubstructMatch. A simple example follows: O_or_C = Chem.MolFromSmarts('[O,C]') O = Chem.MolFromSmiles('O') C = Chem.MolFromSmiles('C') O_or_C.HasSubstructMatch(O) True O_or_C.HasSubstructMatch(C) False O.HasSubstructMatch(O_or_C) True C.HasSubstructMatch(O_or_C) True We also see: C_or_O = Chem.MolFromSmarts('[C,O]') C_or_O.HasSubstructMatch(O) False C_or_O.HasSubstructMatch(C) True so the order of elements in a SMARTS 'or' statement changes the behaviour, which is unexpected. This is indeed related. This is a case I didn't cover above: the SMILES/SMARTS match. The behavior above is expected from the point of view of what's in the code, though I can understand how it may not make much sense from the perspective of someone using the code. :-) The above should probably return False in both cases. In general, one should probably expect that using the HasSubstructMatch() method of a molecule constructed from SMARTS is likely to produce strange results. Getting a general purpose query--query matcher to work is, as far as I can tell, a decidedly non-trivial problem. -greg -- Subversion Kills Productivity. Get off Subversion Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching
Hi Toby, On Fri, Mar 7, 2014 at 11:57 AM, Toby Wright toby.wri...@inhibox.comwrote: Thanks Greg, The final strange behaviour I've noticed that could trip fellow users up is with matching kekule verses aromatic representations of the same molecule in SMARTS against SMILES. Most surprisingly C1=CC=CC=C1 is not a substructure of itself but has c1c1 as a substructure (if the lefthand term is SMILES and the right is SMARTS in both cases). Code to demonstrate what I mean below: aromatic_benzene_smiles = Chem.MolFromSmiles('c1c1') aromatic_benzene_smarts = Chem.MolFromSmarts('c1c1') kekule_benzene_smiles = Chem.MolFromSmiles('C1=CC=CC=C1') kekule_benzene_smarts = Chem.MolFromSmarts('C1=CC=CC=C1') aromatic_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts) True aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smiles) True aromatic_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts) False kekule_benzene_smiles.HasSubstructMatch(kekule_benzene_smarts) False kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smiles) True kekule_benzene_smiles.HasSubstructMatch(aromatic_benzene_smarts) True I think I can see why there is a difference in behaviour, a double bond is not the same thing as an aromatic bond. In the SMILES case a conversion can take place because the context is complete but in the SMARTS case it is not (or at least might not be). But I thought I'd point out the issue in any case. The workaround is to always explicitly make atoms aromatic in SMARTS if you wish them to match aromatic SMILES rather than relying on the kekule representation to sort it for you. There have been a couple of discussion of related topics on the list already. There is a nice discussion of this in the Daylight Theory Manual page for SMARTS: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html look for SMARTS versus SMILES -greg -- Subversion Kills Productivity. Get off Subversion Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching
Hi Greg, Thanks a lot for the explanation. It makes things clearer now. Well the reason I'm doing SMARTS-SMARTS match is because I would like to match functional groups with the reactants in reactions. Regards, Christos Christos Kannas Researcher Ph.D Student Mob (UK): +44 (0) 7447700937 Mob (Cyprus): +357 99530608 [image: View Christos Kannas's profile on LinkedIn]http://cy.linkedin.com/in/christoskannas On 5 March 2014 04:44, Greg Landrum greg.land...@gmail.com wrote: Hi Christos, On Tue, Mar 4, 2014 at 3:46 PM, Christos Kannas chriskan...@gmail.comwrote: Hi all, Why does the following happen? In [1]: from rdkit import Chem In [2]: from rdkit.Chem import AllChem In [3]: from rdkit.Chem import Draw In [4]: patt = Chem.MolFromSmarts([CH;D2;!$(C-[!#6;!#1])]=O) In [5]: z2 = Chem.MolFromSmarts([*]-C-C([H])(=O), 1) In [6]: print Chem.MolToSmiles(z2) [*]CC=O In [7]: print Chem.MolToSmarts(z2) *-C-[C!H0]=O In [9]: z2.HasSubstructMatch(patt) Out[9]: False In [10]: z3 = Chem.MolFromSmiles(Chem.MolToSmiles(z2)) In [11]: print Chem.MolToSmiles(z3) [*]CC=O In [12]: print Chem.MolToSmarts(z3) [*]-[#6]-[#6]=[#8] In [13]: z3.HasSubstructMatch(patt) Out[13]: True Shouldn't be that z2 and z3 have the same information? The way SMARTS/SMARTS matches is handled is different than the way SMARTS/SMILES matches works. The short answer is that when doing a SMARTS/SMARTS match, the RDKit compares the queries to each other; when doing a SMARTS/SMILES match, on the other hand, it checks to see if the atoms in the SMILES molecule match the queries in the SMARTS molecule. A bit longer answer: Molecules built using MolFromSmiles contain Atoms, molecules built using MolFromSmarts contain QueryAtoms. Both atoms and QueryAtoms have a Match() method that takes another Atom or QueryAtom as an argument and returns whether or not the two match. The substructure matching code makes heavy use of this Match() method. QueryAtom.Match(Atom) checks to see if the Atom satisfies the query. QueryAtom.Match(QueryAtom) checks to see if the queries on the atoms are the same. This uses a crude approach that is easy to fool, but I assume that a SMARTS-SMARTS match is not a frequent thing someone wants to do. query-query matching is also not a particularly easy problem to solve in a general way. -greg -- Subversion Kills Productivity. Get off Subversion Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SMARTS/SMARTS and SMILES/SMARTS substructure matching
Hi, This is probably related to the above so I thought I'd post it on this thread. I am noticing inconsistent behaviour when a molecule created via SMARTS that contains an 'or' statement has HasSubstructMatch called on it, as opposed to it being the argument to HasSubstructMatch. A simple example follows: O_or_C = Chem.MolFromSmarts('[O,C]') O = Chem.MolFromSmiles('O') C = Chem.MolFromSmiles('C') O_or_C.HasSubstructMatch(O) True O_or_C.HasSubstructMatch(C) False O.HasSubstructMatch(O_or_C) True C.HasSubstructMatch(O_or_C) True We also see: C_or_O = Chem.MolFromSmarts('[C,O]') C_or_O.HasSubstructMatch(O) False C_or_O.HasSubstructMatch(C) True so the order of elements in a SMARTS 'or' statement changes the behaviour, which is unexpected. Yours, Toby Wright -- InhibOx Ltd On 5 March 2014 10:10, Christos Kannas chriskan...@gmail.com wrote: Hi Greg, Thanks a lot for the explanation. It makes things clearer now. Well the reason I'm doing SMARTS-SMARTS match is because I would like to match functional groups with the reactants in reactions. Regards, Christos Christos Kannas Researcher Ph.D Student Mob (UK): +44 (0) 7447700937 Mob (Cyprus): +357 99530608 [image: View Christos Kannas's profile on LinkedIn]http://cy.linkedin.com/in/christoskannas On 5 March 2014 04:44, Greg Landrum greg.land...@gmail.com wrote: Hi Christos, On Tue, Mar 4, 2014 at 3:46 PM, Christos Kannas chriskan...@gmail.comwrote: Hi all, Why does the following happen? In [1]: from rdkit import Chem In [2]: from rdkit.Chem import AllChem In [3]: from rdkit.Chem import Draw In [4]: patt = Chem.MolFromSmarts([CH;D2;!$(C-[!#6;!#1])]=O) In [5]: z2 = Chem.MolFromSmarts([*]-C-C([H])(=O), 1) In [6]: print Chem.MolToSmiles(z2) [*]CC=O In [7]: print Chem.MolToSmarts(z2) *-C-[C!H0]=O In [9]: z2.HasSubstructMatch(patt) Out[9]: False In [10]: z3 = Chem.MolFromSmiles(Chem.MolToSmiles(z2)) In [11]: print Chem.MolToSmiles(z3) [*]CC=O In [12]: print Chem.MolToSmarts(z3) [*]-[#6]-[#6]=[#8] In [13]: z3.HasSubstructMatch(patt) Out[13]: True Shouldn't be that z2 and z3 have the same information? The way SMARTS/SMARTS matches is handled is different than the way SMARTS/SMILES matches works. The short answer is that when doing a SMARTS/SMARTS match, the RDKit compares the queries to each other; when doing a SMARTS/SMILES match, on the other hand, it checks to see if the atoms in the SMILES molecule match the queries in the SMARTS molecule. A bit longer answer: Molecules built using MolFromSmiles contain Atoms, molecules built using MolFromSmarts contain QueryAtoms. Both atoms and QueryAtoms have a Match() method that takes another Atom or QueryAtom as an argument and returns whether or not the two match. The substructure matching code makes heavy use of this Match() method. QueryAtom.Match(Atom) checks to see if the Atom satisfies the query. QueryAtom.Match(QueryAtom) checks to see if the queries on the atoms are the same. This uses a crude approach that is easy to fool, but I assume that a SMARTS-SMARTS match is not a frequent thing someone wants to do. query-query matching is also not a particularly easy problem to solve in a general way. -greg -- Subversion Kills Productivity. Get off Subversion Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Subversion Kills Productivity. Get off Subversion Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951iu=/4140/ostg.clktrk___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss