Hi Greg, Personally I like your suggestion of the behaviour similar to SMARTS. That way one can decide to whwat level of granularity one wants. Obviously it also means that we have to think a bit more about our queries and database preparations - I am sure though this will only improve our data pool.
My 2 pence Nik -----Original Message----- From: Greg Landrum [mailto:[email protected]] Sent: Thursday, March 01, 2012 8:38 AM To: RDKit Discuss Subject: [Rdkit-discuss] improving substructure search behavior with real molecules Dear all, Jean-Paul has recently posted a couple of bugs related to the way the RDKit handles substructure matches between molecules that are not built from SMARTS. A short, general summary of the problem is shown here: In [2]: Chem.MolFromSmiles('CCC').HasSubstructMatch(Chem.MolFromSmiles('CC[14C]')) Out[2]: True In [3]: Chem.MolFromSmiles('CCO').HasSubstructMatch(Chem.MolFromSmiles('CC[O-]')) Out[3]: True The reason this happens is that the atom-atom matching code at the moment only considers atomic number, so any O matches any other O. Here's a table showing some examples of the current behavior (easier to see in a fixed-width font): | Molecule | Query | Match | | CCO | CCO | Yes | | CC[O-] | CCO | Yes | | CCO | CC[O-] | Yes | | CC[O-] | CC[O-] | Yes | | CC[O-] | CC[OH] | Yes | | CCOC | CC[OH] | Yes | | CCOC | CCO | Yes | | CCC | CCC | Yes | | CC[14C] | CCC | Yes | | CCC | CC[14C] | Yes | | CC[14C] | CC[14C] | Yes | It is quite simple to change this behavior so that it's somewhat more intuitive, but doing so requires making some decisions about what the semantics of these searches should be. The easiest thing would be to go from the current overly general definition to something that is very specific where all atomic properties in the molecule and query must match. This gives the following table: | Molecule | Query | Match | | CCO | CCO | Yes | | CC[O-] | CCO | No | | CCO | CC[O-] | No | | CC[O-] | CC[O-] | Yes | | CC[O-] | CC[OH] | No | | CCOC | CC[OH] | No | | CCOC | CCO | Yes | | CCC | CCC | Yes | | CC[14C] | CCC | No | | CCC | CC[14C] | No | | CC[14C] | CC[14C] | Yes | I think this would also provide unexpected results. Particularly things like this: | CC[O-] | CCO | No | My proposal for a fix is to adopt semantics similar to SMARTS: if you don't specify something in the query, then it's not used as part of the matching criteria. This gives the following table: | Molecule | Query | Match | | CCO | CCO | Yes | | CC[O-] | CCO | Yes | | CCO | CC[O-] | No | | CC[O-] | CC[O-] | Yes | | CC[O-] | CC[OH] | No | | CCOC | CC[OH] | No | | CCOC | CCO | Yes | | CCC | CCC | Yes | | CC[14C] | CCC | Yes | | CCC | CC[14C] | No | | CC[14C] | CC[14C] | Yes | This is easy to implement and should not have too much impact on substructure search speeds. Comments? Suggestions? -greg ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

