Re: [Rdkit-discuss] how to use structure as substructure query

Greg Landrum Mon, 03 Dec 2012 20:33:36 -0800

On Mon, Dec 3, 2012 at 9:58 PM, Andrew Dalke <[email protected]>wrote:


> On Dec 3, 2012, at 4:55 PM, Greg Landrum wrote:
> > Yes, it's here:
> >
> http://www.rdkit.org/docs/RDKit_Book.html#atom-atom-matching-in-substructure-queries
>
> Thanks.
>
> It's incomplete though - it doesn't show how bonds are matched nor
> how aromaticity is handled for atoms. Does a SMILES with a "C" mean
> that aromaticity is specified, and so that "c" is not matched? I
> can't determine that from the docs.
>

Aromaticity is not used in the matching criteria for atoms.
Bonds are matched purely using bond type, with the one exception that a
bond of unspecified type matches anything and is matched by anything.


> I suspect the following shows an incorrect implementation:
>
> >>> query = Chem.MolFromSmiles("CC")
> >>> target = Chem.MolFromSmiles("c1ccccc1C")
> >>> target.HasSubstructMatch(query)
> True
> >>> target = Chem.MolFromSmiles("c1ccccc1")
> >>> target.HasSubstructMatch(query)
> False
>
>   I did not expect a "CC" to match the "cC".
>

Aromaticity is ignored, so this is correct.


> There's also a strangeness in the following, where
> a single bond can match a double:
>
> >>> query = Chem.MolFromSmiles("CC")
> >>> target = Chem.MolFromSmiles("c1cccc1=C")
> >>> target.HasSubstructMatch(query)
> True
> >>> target = Chem.MolFromSmiles("c1ccccc1")
> >>> target.HasSubstructMatch(query)
> False
>

The single isn't matching a double; it's matching the ring bonds:

In [9]: p = Chem.MolFromSmiles("CC")
In [10]: m = Chem.MolFromSmiles("c1cccc1=C")
In [12]: m.GetSubstructMatch(p)
Out[12]: (0, 4)

Because that ring is not aromatic:

In [13]: Chem.MolToSmiles(m)
Out[13]: 'C=C1C=CC=C1'

... even when I explicitly give a single bond:
>
> >>> query = Chem.MolFromSmiles("C-C")
> >>> target = Chem.MolFromSmiles("c1cccc1=C")
> >>> target.HasSubstructMatch(query)
> True
> >>> target = Chem.MolFromSmiles("c1ccccc1")
> >>> target.HasSubstructMatch(query)
> False
>

Same story. The query molecules are exactly the same in each case.


> The reason this is important to what I'm doing is that I
> am developing new SMARTS patterns for screening. One of my
> patterns is "CC". Consider the following case:
>
> My query is:  CC1=Cc2ccccc2CN1
> My target is: c1ccc2c(c1)C=C3c4ccccc4C(=O)N3[C@@H]2O
>
> >>> query = Chem.MolFromSmiles("CC1=Cc2ccccc2CN1")
> >>> target = Chem.MolFromSmiles("c1ccc2c(c1)C=C3c4ccccc4C(=O)N3[C@@H]2O")
>
> Here's the code which does the screening.
>
> >>> screen = Chem.MolFromSmarts("CC")
> >>> query.HasSubstructMatch(screen)
> True
> >>> target.HasSubstructMatch(screen)
> False
> >>>
>
> This should mean that the target is screened out. However,
> RDKit says that the query is actually a substructure of the target:
>
> >>> target.HasSubstructMatch(query)
> True
>
> This means the the SMARTS pattern "CC" is a false screen.
>

That's correct behavior. The SMARTS pattern "CC" means [Aliphatic
Carbon](single or aromatic bond)[Aliphatic Carbon]. That pattern does not
exist in the target, but it does exist in the query.

Based on this, it seems that I can't use SMARTS patterns to define
> a screen which is easily compatible with the molecule-based substructure
> matcher.
>

Sure you can, but you'll need to use atomic numbers in the SMARTS instead
of letters in order to avoid the aromatic/alphatic queries.


> What I think I can do is:
>   1) parse the SMILES for the query
>   2) remove any explicit hydrogens
>   3) use Chem.MolFragmentToSmiles to turn the de-hydrogenated molecule
>       into a SMARTS string
>   4) convert the SMARTS into the actual query
>

I think you can just use Chem.MolToSmiles(dhmol,canonical=False), but
otherwise this flow looks ok. If you could "trust" your users to always
provide aromatic SMILES, you could just skip this whole mess and use
MolFromSmarts at the beginning. I guess you're trying to avoid that though.


> But I know that MolFragmentToSmiles is a new API function, and I'm
> pretty certain that you do something else in your Postgres cartridge.
> We even had an exchange about a year ago on improving the SMARTS
> patterns which you use for screening.
>
So, how do you screen so that you can use an input molecule as a query?
>

The PostgreSQL cartridge uses LayeredFingerprint2, which has a set of
extremely generic SMARTS queries:[1]
                      "[*]~[*]",
                      "[*]~[*]~[*]",
                      "[R]~1~[R]~[R]~1",
                      "[*]~[*]~[*]~[*]",
                      "[*]~[*](~[*])~[*]",
                      "[*]~[R]~1[R]~[R]~1",
                      "[R]~1[R]~[R]~[R]~1",
                      "[*]~[*]~[*]~[*]~[*]",
                      "[*]~[*]~[*](~[*])~[*]",
                      "[*]~[R]~1[R]~[R]~1[*]",
                      "[R]~1[R]~[R]~[R]~[R]~1",
                      "[R]~1[R]~[R]~[R]~[R]~[R]~1",

These don't suffer from the aromatic/aliphatic problem.

-greg
[1] and looking at these reminds me that I need to go back at some point
and finish the tuning work... <sigh>

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] how to use structure as substructure query

Reply via email to