On Mon, Dec 3, 2012 at 9:58 PM, Andrew Dalke <[email protected]>wrote:
> On Dec 3, 2012, at 4:55 PM, Greg Landrum wrote:
> > Yes, it's here:
> >
> http://www.rdkit.org/docs/RDKit_Book.html#atom-atom-matching-in-substructure-queries
>
> Thanks.
>
> It's incomplete though - it doesn't show how bonds are matched nor
> how aromaticity is handled for atoms. Does a SMILES with a "C" mean
> that aromaticity is specified, and so that "c" is not matched? I
> can't determine that from the docs.
>
Aromaticity is not used in the matching criteria for atoms.
Bonds are matched purely using bond type, with the one exception that a
bond of unspecified type matches anything and is matched by anything.
> I suspect the following shows an incorrect implementation:
>
> >>> query = Chem.MolFromSmiles("CC")
> >>> target = Chem.MolFromSmiles("c1ccccc1C")
> >>> target.HasSubstructMatch(query)
> True
> >>> target = Chem.MolFromSmiles("c1ccccc1")
> >>> target.HasSubstructMatch(query)
> False
>
> I did not expect a "CC" to match the "cC".
>
Aromaticity is ignored, so this is correct.
> There's also a strangeness in the following, where
> a single bond can match a double:
>
> >>> query = Chem.MolFromSmiles("CC")
> >>> target = Chem.MolFromSmiles("c1cccc1=C")
> >>> target.HasSubstructMatch(query)
> True
> >>> target = Chem.MolFromSmiles("c1ccccc1")
> >>> target.HasSubstructMatch(query)
> False
>
The single isn't matching a double; it's matching the ring bonds:
In [9]: p = Chem.MolFromSmiles("CC")
In [10]: m = Chem.MolFromSmiles("c1cccc1=C")
In [12]: m.GetSubstructMatch(p)
Out[12]: (0, 4)
Because that ring is not aromatic:
In [13]: Chem.MolToSmiles(m)
Out[13]: 'C=C1C=CC=C1'
... even when I explicitly give a single bond:
>
> >>> query = Chem.MolFromSmiles("C-C")
> >>> target = Chem.MolFromSmiles("c1cccc1=C")
> >>> target.HasSubstructMatch(query)
> True
> >>> target = Chem.MolFromSmiles("c1ccccc1")
> >>> target.HasSubstructMatch(query)
> False
>
Same story. The query molecules are exactly the same in each case.
> The reason this is important to what I'm doing is that I
> am developing new SMARTS patterns for screening. One of my
> patterns is "CC". Consider the following case:
>
> My query is: CC1=Cc2ccccc2CN1
> My target is: c1ccc2c(c1)C=C3c4ccccc4C(=O)N3[C@@H]2O
>
> >>> query = Chem.MolFromSmiles("CC1=Cc2ccccc2CN1")
> >>> target = Chem.MolFromSmiles("c1ccc2c(c1)C=C3c4ccccc4C(=O)N3[C@@H]2O")
>
> Here's the code which does the screening.
>
> >>> screen = Chem.MolFromSmarts("CC")
> >>> query.HasSubstructMatch(screen)
> True
> >>> target.HasSubstructMatch(screen)
> False
> >>>
>
> This should mean that the target is screened out. However,
> RDKit says that the query is actually a substructure of the target:
>
> >>> target.HasSubstructMatch(query)
> True
>
> This means the the SMARTS pattern "CC" is a false screen.
>
That's correct behavior. The SMARTS pattern "CC" means [Aliphatic
Carbon](single or aromatic bond)[Aliphatic Carbon]. That pattern does not
exist in the target, but it does exist in the query.
Based on this, it seems that I can't use SMARTS patterns to define
> a screen which is easily compatible with the molecule-based substructure
> matcher.
>
Sure you can, but you'll need to use atomic numbers in the SMARTS instead
of letters in order to avoid the aromatic/alphatic queries.
> What I think I can do is:
> 1) parse the SMILES for the query
> 2) remove any explicit hydrogens
> 3) use Chem.MolFragmentToSmiles to turn the de-hydrogenated molecule
> into a SMARTS string
> 4) convert the SMARTS into the actual query
>
I think you can just use Chem.MolToSmiles(dhmol,canonical=False), but
otherwise this flow looks ok. If you could "trust" your users to always
provide aromatic SMILES, you could just skip this whole mess and use
MolFromSmarts at the beginning. I guess you're trying to avoid that though.
> But I know that MolFragmentToSmiles is a new API function, and I'm
> pretty certain that you do something else in your Postgres cartridge.
> We even had an exchange about a year ago on improving the SMARTS
> patterns which you use for screening.
>
So, how do you screen so that you can use an input molecule as a query?
>
The PostgreSQL cartridge uses LayeredFingerprint2, which has a set of
extremely generic SMARTS queries:[1]
"[*]~[*]",
"[*]~[*]~[*]",
"[R]~1~[R]~[R]~1",
"[*]~[*]~[*]~[*]",
"[*]~[*](~[*])~[*]",
"[*]~[R]~1[R]~[R]~1",
"[R]~1[R]~[R]~[R]~1",
"[*]~[*]~[*]~[*]~[*]",
"[*]~[*]~[*](~[*])~[*]",
"[*]~[R]~1[R]~[R]~1[*]",
"[R]~1[R]~[R]~[R]~[R]~1",
"[R]~1[R]~[R]~[R]~[R]~[R]~1",
These don't suffer from the aromatic/aliphatic problem.
-greg
[1] and looking at these reminds me that I need to go back at some point
and finish the tuning work... <sigh>
------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss