Re: [Rdkit-discuss] possible SMARTS translating mistake?

Bodle, Christopher R Thu, 17 Sep 2015 08:18:44 -0700

All (and Greg),

After responding to Greg's email I read the email from Andrew Dalke for my 
other thread ("trouble with SMARTS interpretation of 'not hydrogen'") who 
informed me that it is not appropriate to to a sanitization of a molecule that 
comes from a SMARTS query, because this converts a query-based molecule in to a 
more chemical molecule and the query molecule loses some of it's query 
properties.  For example I had several molecule in the SMARTS with the first 
carbon atom labeled as [c,C].  During my sanitization it only kept c, which 
then threw up a sanitization error saying a non-aromatic molecule was labeled 
as aromatic.


I now believe that my initial PAINS filtration worked properly, and I just do 
not have very many compounds that were flagged as PAINS in this screen.

I would like to test this against the RDKit in house PAINS filters, but I ran 
in to a problem trying to implement them.  When I tried to run:
from rdkit.Chem import FilterCatalog
I got the error message:

ImportError: cannot import name FilterCatalog


Is there another package that I need to download in order to run the 
FliterCatalog functionality?  I do not see mention of it on this page.  
https://github.com/rdkit/rdkit/pull/536


Additionally I am using python, not C++

Thank you all very much for your help.


Christopher R. Bodle

PhD Candidate, University of Iowa

College of Pharmacy

Division of Medicinal and Natural Products Chemistry

115 S. Grand Avenue-Rm. S338

Iowa City, Iowa 52242

(319) 335-7845



________________________________
From: Bodle, Christopher R [christopher-bo...@uiowa.edu]
Sent: Thursday, September 17, 2015 8:47 AM
To: Greg Landrum
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake?

Greg,

Thanks for the reply.  I will clarify a little bit.

The example provided is one of the SMARTS representations of one of the PAINS 
compounds from Rjarshi Guha's blog.  My goal is to filter my list of hit 
compounds from an HTS campaign against these PAINS filters, primarily by using 
the .HasSubstructMatch function in RDKit.  I had already tested the filtering 
code with additional lists of problematic substructures found in the 
supplemental of Lagorce,Beall et.al. (FAF-Drugs3: a web server for compound 
property calculation and chemical library design), and those worked fine.  For 
example when I ran a filter with the Toxicophore subset, 122 of my 131 hit 
compounds were identified as having one or more toxicophore moieties.  When I 
ran the filtering code with non-standardized PAINS compounds I only got 
substructure matches with 3 of the 516 filter compounds.  It was then suggested 
to me that I should try and standardize the PAINS library.  To do this I found 
a standardizing function using the MolVS package, which is outlined here: 
http://molvs.readthedocs.org/en/latest/guide/intro.html

the standardization process that is utilized by the function s.standardize in 
my code below is outlined lower on that page.

When I filtered using the PAINS library after standardization, I now had 
matches with 10 of the 516 filter compounds, and 42 flagged compounds from the 
hit compound list (vs 21 flagged compounds with a non-standardized filter 
list), but I also had 201 compounds of the 516 that did not produce a 
standardized mol structure.

So I guess what I am trying to accomplish by standardizing the queries is put 
them in a standardized conformation that would allow for better results with 
.HasSubstructMatch.  I see now that one main reason behind the standardization 
not working is because I take a SMARTS string containing query features and try 
to make it a SMILES string for the standardization.  I only did this because 
the examples using MolVS uses a .MolFromSmiles.  So I will first try to simply 
use .MolFromSmarts format to see if that rectifies my problem.  I don't see why 
it wouldn't, since the input for s.standardize is a mol_file.  However if the 
standardization code is based on SMILES format then there may be an issue.  I 
will try today and report back to let the RDKit community know how it goes.

One last question, are there plans to have a new rendering code for python 
based RDKit users as well?

Thank you again Greg,



Christopher R. Bodle

PhD Candidate, University of Iowa

College of Pharmacy

Division of Medicinal and Natural Products Chemistry

115 S. Grand Avenue-Rm. S338

Iowa City, Iowa 52242

(319) 335-7845



________________________________
From: Greg Landrum [greg.land...@gmail.com]
Sent: Wednesday, September 16, 2015 8:03 PM
To: Bodle, Christopher R
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake?



On Tue, Sep 15, 2015 at 6:48 PM, Bodle, Christopher R 
<christopher-bo...@uiowa.edu<mailto:christopher-bo...@uiowa.edu>> wrote:

I am working on a filtering code in python to search for substructure matches 
against my hit list (in SMILES) and my filter lists (in SMARTS).  My current 
filter lists were copied from Rajarshi Guha's blog at 
http://blog.rguha.net/?p=850.

The topic of the new version of these SMARTS and the associated blog post came 
up elsewhere in the thread, so I don't need to raise that again.

While working on this I was working with the following SMARTS string from the 
p_l150 collection, filter purrole_A(118):


n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]

I have highlighted the problem area in the string.  Although this should be 
interpreted as 'not H', the rendering generated from Chem.MolFromSmarts does 
indeed result in a hydrogen in this position,

The rendering of molecules is not necessarily the best way to discover what the 
RDKit actually thinks they are. The rendering code uses the atomic number to 
determine what to use for atom labels, this doesn't make a lot of sense for 
queries. The C++-based rendering in the new version will do a more expliciti 
job here by indicating that there is a query feature. There's an example in 
"Out [22]" here: 
https://github.com/rdkit/UGM_2015/blob/master/Notebooks/Whats_new.ipynb

For molecules with query features: the best approach to determine what the 
RDKit is doing is to use SMARTS. In your case this gives:

In [2]: m = 
Chem.MolFromSmarts('n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]')

In [3]: Chem.MolToSmarts(m)
Out[3]: 
'n1(-[#6]2:[!#1]:[#6]:[#6]:[#6]:[#6]:2):,-c(-,:c:,-c(-,:c:,-1-[#6&X4])-[#1])-[#6&X4]'

In [4]: nm = Chem.MergeQueryHs(m)

In [5]: Chem.MolToSmarts(nm)
Out[5]: 
'n1(-[#6]2:[!#1]:[#6]:[#6]:[#6]:[#6]:2):,-c(-,:c:,-[c&!H0]-,:c:,-1-[#6&X4])-[#6&X4]'

You can see that the [!#1] is preserved both before and after the 
MergeQueryHs() call.

You can also test that the query is working properly by running it against 
molecules that you know should work. Here's a simple example of that which 
includes [!#1]:

In [6]: sm  = Chem.MolFromSmarts('[#6][!#1][#6]')

In [7]: Chem.MolToSmarts(sm)
Out[7]: '[#6]-,:[!#1]-,:[#6]'

In [8]: Chem.MolFromSmiles('CNC').GetSubstructMatch(sm)
Out[8]: (0, 1, 2)

In [9]: Chem.MolFromSmiles('CCC').GetSubstructMatch(sm)
Out[9]: (0, 1, 2)

In [10]: Chem.MolFromSmiles('C[H-]C').GetSubstructMatch(sm)
Out[10]: ()

which is in the middle of an aromatic ring and results in a valency issue and 
as such I can't standardize the mol for filtering purposes.

This "standardization" is what's causing the problems I think.

I confirmed this by making the following edit to the SMILES string:
n2(-[#6]:1:[!#6]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]

Which results in a carbon in the position of the hydrogen from the original 
SMARTS.  Is this a problem with the SMARTS translator?  Or is there something 
that I am missing?

The above string is a SMARTS. It is not a valid SMILES since it contains query 
features. A molecule constructed from it using MolFromSmarts() also contains 
query features and so when you try to convert it to SMILES (as your code below 
does), you don't get an accurate representation.

What are you trying to accomplish by standardizing the queries and what are you 
actually doing when you standardize them?

Best,
-greg


I believe this happens quite frequently.  When running a standardization code 
for the filter p_l150 (55 compounds) using:

p_l150['standardized mol']=''
imax,jmax = p_l150.shape
for i in range(imax):
    mol_file =mf= p_l150.loc[i,'mol file']
    s = Standardizer()
    try:
        m = Chem.MolToSmiles(mf)
        m2 = standardize_smiles(m)
        m3 = Chem.MolFromSmiles(m2)
        smol = s.standardize(m3)
        p_l150.loc[i,'standardized mol'] = smol
    except Exception as e:
        print p_l150.loc[i,'filter'], e
p_l150

I return 11 errors, 8 of which are valency (7 of those involve hydrogens):


<regId="pyrrole_A(118)"> Sanitization error: Explicit valence for atom # 8 H, 
3, is greater than permitted
<regId="imine_one_fives(89)"> Sanitization error: Explicit valence for atom # 3 
H, 3, is greater than permitted
<regId="hzone_pipzn(79)"> Sanitization error: Explicit valence for atom # 3 H, 
2, is greater than permitted
<regId="hzone_pyrrol(64)"> Sanitization error: Can't kekulize mol

<regId="cyano_pyridone_A(54)"> Sanitization error: Explicit valence for atom # 
1 H, 3, is greater than permitted
<regId="het_pyridiniums_A(39)"> Sanitization error: Explicit valence for atom # 
5 H, 3, is greater than permitted
<regId="diazox_sulfon_A(36)"> Sanitization error: Explicit valence for atom # 
14 C, 5, is greater than permitted
<regId="pyrrole_B(29)"> Sanitization error: Explicit valence for atom # 9 H, 3, 
is greater than permitted
<regId="thiophene_hydroxy(28)"> Sanitization error: Can't kekulize mol

<regId="imidazole_A(19)"> Sanitization error: Explicit valence for atom # 4 H, 
2, is greater than permitted
<regId="het_6_tetrazine(18)"> Sanitization error: Aromatic bonds on non 
aromatic atom 1


Any insight would be greatly appreciated.


Thank you


Christopher R. Bodle

PhD Candidate, University of Iowa

College of Pharmacy

Division of Medicinal and Natural Products Chemistry

115 S. Grand Avenue-Rm. S338

Iowa City, Iowa 52242

(319) 335-7845<tel:%28319%29%20335-7845>



------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] possible SMARTS translating mistake?

Reply via email to