Re: [Rdkit-discuss] how to use structure as substructure query

Andrew Dalke Tue, 04 Dec 2012 02:39:12 -0800

I am beginning to realize the error of my ways.

This is the same issue which occurred in fmcs. Suppose
you have c1ccccc1C and CC. The MCS between those two is
[#6]-[#6]. Atom aromaticity is not useful when doing
a comparison.

On Dec 4, 2012, at 5:32 AM, Greg Landrum wrote:
> Aromaticity is ignored, so this is correct.

Yes. Atom aromaticity is ignored. Bond aromaticity is not.

>>> target = Chem.MolFromSmiles("c1ccccc1")
>>> query = Chem.MolFromSmiles("C1CCCCC1")
>>> target.HasSubstructMatch(query)
False
>>> query.HasSubstructMatch(target)
False

It comes down to if the query bonds are marked as SINGLE
or AROMATIC. In the following, the bonds of the first
are all SINGLE, and in the second, AROMATIC:

>>> query = Chem.MolFromSmiles("[C]1[C][C][C][C][C]1")
>>> query.HasSubstructMatch(target)
False
>>> target.HasSubstructMatch(query)
False
>>> query = Chem.MolFromSmiles("[C]:1:[C]:[C]:[C]:[C]:[C]:1")
>>> target.HasSubstructMatch(query)
True

> The single isn't matching a double; it's matching the ring bonds:

D'oh! Yes, I forgot about the c1...c1 as really being "single
or aromatic", and where it's up to aromaticity perception to
mark it as really being single or aromatic. My downstream analysis
assume that the rings would stay aromatic, and I didn't look
far enough into the details.

>> Based on this, it seems that I can't use SMARTS patterns to define
>> a screen which is easily compatible with the molecule-based substructure
>> matcher.
>> 
> Sure you can, but you'll need to use atomic numbers in the SMARTS instead of 
> letters in order to avoid the aromatic/alphatic queries.

Very good point. I'll have to go back and redo how I build my SMARTS fragments
in the first place. But then again, if I use aromaticity-free SMARTS then I'll
end up with a toolkit independent set of patterns, which is what you have:

>                       "[R]~1[R]~[R]~[R]~[R]~[R]~1",
> 
>      These don't suffer from the aromatic/aliphatic problem.

However, this then places a larger burden on the substructure matcher,
since it will say that cyclohexane doesn't match benzene and vice versa,
even though the above screen does not distinguish between the two.

It seems that having some aromatic bond-based screens would help.

BTW, there's a typo in the above. It should be "[R]~1~[R]~[R]~[R]~[R]~[R]~1".
The pattern is missing a '~' between the first and second [R]. But 6 element
rings with only double and triple bonds are .. explosive? At the very
least, quite unlikely.

Searching for *=1=*=*=*=*=*=1 finds 29 highly unusual structures
in PubChem:

Most of them are like
  http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=517944&loc=ec_rcs
  http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5231534&loc=ec_rcs
where the bond type cannot be represented in SMILES.

There are a few tri-boronic rings
  http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=10170877&loc=ec_rcs
  http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21598447&loc=ec_rcs
  http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21598448&loc=ec_rcs
  http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=21732501&loc=ec_rcs
where the [B-]-[N+] or [B-]-[P+] is normalized to [B]=[N] or [B]=[P] for
the match. Eg, one of the structures is:

  [B-]1([P+]([B-]([P+]([B-]([P+]1(C)C)(Cl)Cl)(C)C)(Cl)Cl)(C)C)(Cl)Cl

And then there's
  http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=19917582&loc=ec_rcs
which also has a charge separated core of [N+]1([P-][N+]([P-][N+]([P-]1)
and where I could see the justification of saying that those match double bonds.

In other words, the missing '~' shouldn't affect any of the queries
you or anyone else has done with the cartridge.

>>  
>> What I think I can do is:
>>   1) parse the SMILES for the query
>>   2) remove any explicit hydrogens
>>   3) use Chem.MolFragmentToSmiles to turn the de-hydrogenated molecule
>>       into a SMARTS string
>>   4) convert the SMARTS into the actual query
>> 
> I think you can just use Chem.MolToSmiles(dhmol,canonical=False), but 
> otherwise this flow looks ok. If you could "trust" your users to always 
> provide aromatic SMILES, you could just skip this whole mess and use 
> MolFromSmarts at the beginning. I guess you're trying to avoid that though.

Ahh, yes, that would also produce a viable SMARTS.

I can't use MolFromSmarts from the beginning because "n1nc[nH]c1" has the 
explicit hydrogen
which was not a user-specified constraint but only added by the sketcher.

> [1] and looking at these reminds me that I need to go back at some point and 
> finish the tuning work...

Indeed. That is what I'm working on now. :)

                                Andrew
                                [email protected]

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] how to use structure as substructure query

Reply via email to