Dear all,

Jean-Paul has recently posted a couple of bugs related to the way the
RDKit handles substructure matches between molecules that are not
built from SMARTS.
A short, general summary of the problem is shown here:

In [2]: 
Chem.MolFromSmiles('CCC').HasSubstructMatch(Chem.MolFromSmiles('CC[14C]'))
Out[2]: True

In [3]: 
Chem.MolFromSmiles('CCO').HasSubstructMatch(Chem.MolFromSmiles('CC[O-]'))
Out[3]: True

The reason this happens is that the atom-atom matching code at the
moment only considers atomic number, so any O matches any other O.
Here's a table showing some examples of the current behavior (easier
to see in a fixed-width font):
| Molecule | Query   | Match |
| CCO      | CCO     | Yes   |
| CC[O-]   | CCO     | Yes   |
| CCO      | CC[O-]  | Yes   |
| CC[O-]   | CC[O-]  | Yes   |
| CC[O-]   | CC[OH]  | Yes   |
| CCOC     | CC[OH]  | Yes   |
| CCOC     | CCO     | Yes   |
| CCC      | CCC     | Yes   |
| CC[14C]  | CCC     | Yes   |
| CCC      | CC[14C] | Yes   |
| CC[14C]  | CC[14C] | Yes   |

It is quite simple to change this behavior so that it's somewhat more
intuitive, but doing so requires making some decisions about what the
semantics of these searches should be.

The easiest thing would be to go from the current overly general
definition to something that is very specific where all atomic
properties in the molecule and query must match. This gives the
following table:
| Molecule | Query   | Match |
| CCO      | CCO     | Yes   |
| CC[O-]   | CCO     | No    |
| CCO      | CC[O-]  | No    |
| CC[O-]   | CC[O-]  | Yes   |
| CC[O-]   | CC[OH]  | No    |
| CCOC     | CC[OH]  | No    |
| CCOC     | CCO     | Yes   |
| CCC      | CCC     | Yes   |
| CC[14C]  | CCC     | No    |
| CCC      | CC[14C] | No    |
| CC[14C]  | CC[14C] | Yes   |

I think this would also provide unexpected results. Particularly
things like this:
| CC[O-]   | CCO     | No    |

My proposal for a fix is to adopt semantics similar to SMARTS: if you
don't specify something in the query, then it's not used as part of
the matching criteria. This gives the following table:
| Molecule | Query   | Match |
| CCO      | CCO     | Yes   |
| CC[O-]   | CCO     | Yes   |
| CCO      | CC[O-]  | No    |
| CC[O-]   | CC[O-]  | Yes   |
| CC[O-]   | CC[OH]  | No    |
| CCOC     | CC[OH]  | No    |
| CCOC     | CCO     | Yes   |
| CCC      | CCC     | Yes   |
| CC[14C]  | CCC     | Yes   |
| CCC      | CC[14C] | No    |
| CC[14C]  | CC[14C] | Yes   |

This is easy to implement and should not have too much impact on
substructure search speeds.

Comments? Suggestions?

-greg

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to