Greg,

Thanks for this.  Your proposal makes sense - a further question if I am
allowed.

How are Hs handled ?  Are implicit and explicit Hs handled in the same way?

E.g.
Does molecule CCC (defined without Hs) match query [CH3][CH2][CH3] (or
[CH3]) ?  (Do you take in consideration implicit Hs?)
Does molecule [CH3][CH2][CH3] match query CCC ? (This should match as you
can find CCC in the molecule)

-
Jean-Paul Ebejer
Early Stage Researcher


On 1 March 2012 08:34, Stiefl, Nikolaus <[email protected]>wrote:

> Hi Greg,
>
> Personally I like your suggestion of the behaviour similar to SMARTS. That
> way one can decide to whwat level of granularity one wants. Obviously it
> also means that we have to think a bit more about our queries and database
> preparations - I am sure though this will only improve our data pool.
>
> My 2 pence
> Nik
>
>
> -----Original Message-----
> From: Greg Landrum [mailto:[email protected]]
> Sent: Thursday, March 01, 2012 8:38 AM
> To: RDKit Discuss
> Subject: [Rdkit-discuss] improving substructure search behavior with real
> molecules
>
> Dear all,
>
> Jean-Paul has recently posted a couple of bugs related to the way the
> RDKit handles substructure matches between molecules that are not
> built from SMARTS.
> A short, general summary of the problem is shown here:
>
> In [2]:
> Chem.MolFromSmiles('CCC').HasSubstructMatch(Chem.MolFromSmiles('CC[14C]'))
> Out[2]: True
>
> In [3]:
> Chem.MolFromSmiles('CCO').HasSubstructMatch(Chem.MolFromSmiles('CC[O-]'))
> Out[3]: True
>
> The reason this happens is that the atom-atom matching code at the
> moment only considers atomic number, so any O matches any other O.
> Here's a table showing some examples of the current behavior (easier
> to see in a fixed-width font):
> | Molecule | Query   | Match |
> | CCO      | CCO     | Yes   |
> | CC[O-]   | CCO     | Yes   |
> | CCO      | CC[O-]  | Yes   |
> | CC[O-]   | CC[O-]  | Yes   |
> | CC[O-]   | CC[OH]  | Yes   |
> | CCOC     | CC[OH]  | Yes   |
> | CCOC     | CCO     | Yes   |
> | CCC      | CCC     | Yes   |
> | CC[14C]  | CCC     | Yes   |
> | CCC      | CC[14C] | Yes   |
> | CC[14C]  | CC[14C] | Yes   |
>
> It is quite simple to change this behavior so that it's somewhat more
> intuitive, but doing so requires making some decisions about what the
> semantics of these searches should be.
>
> The easiest thing would be to go from the current overly general
> definition to something that is very specific where all atomic
> properties in the molecule and query must match. This gives the
> following table:
> | Molecule | Query   | Match |
> | CCO      | CCO     | Yes   |
> | CC[O-]   | CCO     | No    |
> | CCO      | CC[O-]  | No    |
> | CC[O-]   | CC[O-]  | Yes   |
> | CC[O-]   | CC[OH]  | No    |
> | CCOC     | CC[OH]  | No    |
> | CCOC     | CCO     | Yes   |
> | CCC      | CCC     | Yes   |
> | CC[14C]  | CCC     | No    |
> | CCC      | CC[14C] | No    |
> | CC[14C]  | CC[14C] | Yes   |
>
> I think this would also provide unexpected results. Particularly
> things like this:
> | CC[O-]   | CCO     | No    |
>
> My proposal for a fix is to adopt semantics similar to SMARTS: if you
> don't specify something in the query, then it's not used as part of
> the matching criteria. This gives the following table:
> | Molecule | Query   | Match |
> | CCO      | CCO     | Yes   |
> | CC[O-]   | CCO     | Yes   |
> | CCO      | CC[O-]  | No    |
> | CC[O-]   | CC[O-]  | Yes   |
> | CC[O-]   | CC[OH]  | No    |
> | CCOC     | CC[OH]  | No    |
> | CCOC     | CCO     | Yes   |
> | CCC      | CCC     | Yes   |
> | CC[14C]  | CCC     | Yes   |
> | CCC      | CC[14C] | No    |
> | CC[14C]  | CC[14C] | Yes   |
>
> This is easy to implement and should not have too much impact on
> substructure search speeds.
>
> Comments? Suggestions?
>
> -greg
>
>
> ------------------------------------------------------------------------------
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> ------------------------------------------------------------------------------
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to