Hi Greg,

Personally I like your suggestion of the behaviour similar to SMARTS. That way 
one can decide to whwat level of granularity one wants. Obviously it also means 
that we have to think a bit more about our queries and database preparations - 
I am sure though this will only improve our data pool.

My 2 pence
Nik


-----Original Message-----
From: Greg Landrum [mailto:[email protected]] 
Sent: Thursday, March 01, 2012 8:38 AM
To: RDKit Discuss
Subject: [Rdkit-discuss] improving substructure search behavior with real 
molecules

Dear all,

Jean-Paul has recently posted a couple of bugs related to the way the
RDKit handles substructure matches between molecules that are not
built from SMARTS.
A short, general summary of the problem is shown here:

In [2]: 
Chem.MolFromSmiles('CCC').HasSubstructMatch(Chem.MolFromSmiles('CC[14C]'))
Out[2]: True

In [3]: 
Chem.MolFromSmiles('CCO').HasSubstructMatch(Chem.MolFromSmiles('CC[O-]'))
Out[3]: True

The reason this happens is that the atom-atom matching code at the
moment only considers atomic number, so any O matches any other O.
Here's a table showing some examples of the current behavior (easier
to see in a fixed-width font):
| Molecule | Query   | Match |
| CCO      | CCO     | Yes   |
| CC[O-]   | CCO     | Yes   |
| CCO      | CC[O-]  | Yes   |
| CC[O-]   | CC[O-]  | Yes   |
| CC[O-]   | CC[OH]  | Yes   |
| CCOC     | CC[OH]  | Yes   |
| CCOC     | CCO     | Yes   |
| CCC      | CCC     | Yes   |
| CC[14C]  | CCC     | Yes   |
| CCC      | CC[14C] | Yes   |
| CC[14C]  | CC[14C] | Yes   |

It is quite simple to change this behavior so that it's somewhat more
intuitive, but doing so requires making some decisions about what the
semantics of these searches should be.

The easiest thing would be to go from the current overly general
definition to something that is very specific where all atomic
properties in the molecule and query must match. This gives the
following table:
| Molecule | Query   | Match |
| CCO      | CCO     | Yes   |
| CC[O-]   | CCO     | No    |
| CCO      | CC[O-]  | No    |
| CC[O-]   | CC[O-]  | Yes   |
| CC[O-]   | CC[OH]  | No    |
| CCOC     | CC[OH]  | No    |
| CCOC     | CCO     | Yes   |
| CCC      | CCC     | Yes   |
| CC[14C]  | CCC     | No    |
| CCC      | CC[14C] | No    |
| CC[14C]  | CC[14C] | Yes   |

I think this would also provide unexpected results. Particularly
things like this:
| CC[O-]   | CCO     | No    |

My proposal for a fix is to adopt semantics similar to SMARTS: if you
don't specify something in the query, then it's not used as part of
the matching criteria. This gives the following table:
| Molecule | Query   | Match |
| CCO      | CCO     | Yes   |
| CC[O-]   | CCO     | Yes   |
| CCO      | CC[O-]  | No    |
| CC[O-]   | CC[O-]  | Yes   |
| CC[O-]   | CC[OH]  | No    |
| CCOC     | CC[OH]  | No    |
| CCOC     | CCO     | Yes   |
| CCC      | CCC     | Yes   |
| CC[14C]  | CCC     | Yes   |
| CCC      | CC[14C] | No    |
| CC[14C]  | CC[14C] | Yes   |

This is easy to implement and should not have too much impact on
substructure search speeds.

Comments? Suggestions?

-greg

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to