All (and Greg),
After responding to Greg's email I read the email from Andrew Dalke for my
other thread ("trouble with SMARTS interpretation of 'not hydrogen'") who
informed me that it is not appropriate to to a sanitization of a molecule that
comes from a SMARTS query, because this converts a query-based molecule in to a
more chemical molecule and the query molecule loses some of it's query
properties. For example I had several molecule in the SMARTS with the first
carbon atom labeled as [c,C]. During my sanitization it only kept c, which
then threw up a sanitization error saying a non-aromatic molecule was labeled
as aromatic.
I now believe that my initial PAINS filtration worked properly, and I just do
not have very many compounds that were flagged as PAINS in this screen.
I would like to test this against the RDKit in house PAINS filters, but I ran
in to a problem trying to implement them. When I tried to run:
from rdkit.Chem import FilterCatalog
I got the error message:
ImportError: cannot import name FilterCatalog
Is there another package that I need to download in order to run the
FliterCatalog functionality? I do not see mention of it on this page.
https://github.com/rdkit/rdkit/pull/536
Additionally I am using python, not C++
Thank you all very much for your help.
Christopher R. Bodle
PhD Candidate, University of Iowa
College of Pharmacy
Division of Medicinal and Natural Products Chemistry
115 S. Grand Avenue-Rm. S338
Iowa City, Iowa 52242
(319) 335-7845
________________________________
From: Bodle, Christopher R [christopher-bo...@uiowa.edu]
Sent: Thursday, September 17, 2015 8:47 AM
To: Greg Landrum
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake?
Greg,
Thanks for the reply. I will clarify a little bit.
The example provided is one of the SMARTS representations of one of the PAINS
compounds from Rjarshi Guha's blog. My goal is to filter my list of hit
compounds from an HTS campaign against these PAINS filters, primarily by using
the .HasSubstructMatch function in RDKit. I had already tested the filtering
code with additional lists of problematic substructures found in the
supplemental of Lagorce,Beall et.al. (FAF-Drugs3: a web server for compound
property calculation and chemical library design), and those worked fine. For
example when I ran a filter with the Toxicophore subset, 122 of my 131 hit
compounds were identified as having one or more toxicophore moieties. When I
ran the filtering code with non-standardized PAINS compounds I only got
substructure matches with 3 of the 516 filter compounds. It was then suggested
to me that I should try and standardize the PAINS library. To do this I found
a standardizing function using the MolVS package, which is outlined here:
http://molvs.readthedocs.org/en/latest/guide/intro.html
the standardization process that is utilized by the function s.standardize in
my code below is outlined lower on that page.
When I filtered using the PAINS library after standardization, I now had
matches with 10 of the 516 filter compounds, and 42 flagged compounds from the
hit compound list (vs 21 flagged compounds with a non-standardized filter
list), but I also had 201 compounds of the 516 that did not produce a
standardized mol structure.
So I guess what I am trying to accomplish by standardizing the queries is put
them in a standardized conformation that would allow for better results with
.HasSubstructMatch. I see now that one main reason behind the standardization
not working is because I take a SMARTS string containing query features and try
to make it a SMILES string for the standardization. I only did this because
the examples using MolVS uses a .MolFromSmiles. So I will first try to simply
use .MolFromSmarts format to see if that rectifies my problem. I don't see why
it wouldn't, since the input for s.standardize is a mol_file. However if the
standardization code is based on SMILES format then there may be an issue. I
will try today and report back to let the RDKit community know how it goes.
One last question, are there plans to have a new rendering code for python
based RDKit users as well?
Thank you again Greg,
Christopher R. Bodle
PhD Candidate, University of Iowa
College of Pharmacy
Division of Medicinal and Natural Products Chemistry
115 S. Grand Avenue-Rm. S338
Iowa City, Iowa 52242
(319) 335-7845
________________________________
From: Greg Landrum [greg.land...@gmail.com]
Sent: Wednesday, September 16, 2015 8:03 PM
To: Bodle, Christopher R
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake?
On Tue, Sep 15, 2015 at 6:48 PM, Bodle, Christopher R
<christopher-bo...@uiowa.edu<mailto:christopher-bo...@uiowa.edu>> wrote:
I am working on a filtering code in python to search for substructure matches
against my hit list (in SMILES) and my filter lists (in SMARTS). My current
filter lists were copied from Rajarshi Guha's blog at
http://blog.rguha.net/?p=850.
The topic of the new version of these SMARTS and the associated blog post came
up elsewhere in the thread, so I don't need to raise that again.
While working on this I was working with the following SMARTS string from the
p_l150 collection, filter purrole_A(118):
n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]
I have highlighted the problem area in the string. Although this should be
interpreted as 'not H', the rendering generated from Chem.MolFromSmarts does
indeed result in a hydrogen in this position,
The rendering of molecules is not necessarily the best way to discover what the
RDKit actually thinks they are. The rendering code uses the atomic number to
determine what to use for atom labels, this doesn't make a lot of sense for
queries. The C++-based rendering in the new version will do a more expliciti
job here by indicating that there is a query feature. There's an example in
"Out [22]" here:
https://github.com/rdkit/UGM_2015/blob/master/Notebooks/Whats_new.ipynb
For molecules with query features: the best approach to determine what the
RDKit is doing is to use SMARTS. In your case this gives:
In [2]: m =
Chem.MolFromSmarts('n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]')
In [3]: Chem.MolToSmarts(m)
Out[3]:
'n1(-[#6]2:[!#1]:[#6]:[#6]:[#6]:[#6]:2):,-c(-,:c:,-c(-,:c:,-1-[#6&X4])-[#1])-[#6&X4]'
In [4]: nm = Chem.MergeQueryHs(m)
In [5]: Chem.MolToSmarts(nm)
Out[5]:
'n1(-[#6]2:[!#1]:[#6]:[#6]:[#6]:[#6]:2):,-c(-,:c:,-[c&!H0]-,:c:,-1-[#6&X4])-[#6&X4]'
You can see that the [!#1] is preserved both before and after the
MergeQueryHs() call.
You can also test that the query is working properly by running it against
molecules that you know should work. Here's a simple example of that which
includes [!#1]:
In [6]: sm = Chem.MolFromSmarts('[#6][!#1][#6]')
In [7]: Chem.MolToSmarts(sm)
Out[7]: '[#6]-,:[!#1]-,:[#6]'
In [8]: Chem.MolFromSmiles('CNC').GetSubstructMatch(sm)
Out[8]: (0, 1, 2)
In [9]: Chem.MolFromSmiles('CCC').GetSubstructMatch(sm)
Out[9]: (0, 1, 2)
In [10]: Chem.MolFromSmiles('C[H-]C').GetSubstructMatch(sm)
Out[10]: ()
which is in the middle of an aromatic ring and results in a valency issue and
as such I can't standardize the mol for filtering purposes.
This "standardization" is what's causing the problems I think.
I confirmed this by making the following edit to the SMILES string:
n2(-[#6]:1:[!#6]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]
Which results in a carbon in the position of the hydrogen from the original
SMARTS. Is this a problem with the SMARTS translator? Or is there something
that I am missing?
The above string is a SMARTS. It is not a valid SMILES since it contains query
features. A molecule constructed from it using MolFromSmarts() also contains
query features and so when you try to convert it to SMILES (as your code below
does), you don't get an accurate representation.
What are you trying to accomplish by standardizing the queries and what are you
actually doing when you standardize them?
Best,
-greg
I believe this happens quite frequently. When running a standardization code
for the filter p_l150 (55 compounds) using:
p_l150['standardized mol']=''
imax,jmax = p_l150.shape
for i in range(imax):
mol_file =mf= p_l150.loc[i,'mol file']
s = Standardizer()
try:
m = Chem.MolToSmiles(mf)
m2 = standardize_smiles(m)
m3 = Chem.MolFromSmiles(m2)
smol = s.standardize(m3)
p_l150.loc[i,'standardized mol'] = smol
except Exception as e:
print p_l150.loc[i,'filter'], e
p_l150
I return 11 errors, 8 of which are valency (7 of those involve hydrogens):
<regId="pyrrole_A(118)"> Sanitization error: Explicit valence for atom # 8 H,
3, is greater than permitted
<regId="imine_one_fives(89)"> Sanitization error: Explicit valence for atom # 3
H, 3, is greater than permitted
<regId="hzone_pipzn(79)"> Sanitization error: Explicit valence for atom # 3 H,
2, is greater than permitted
<regId="hzone_pyrrol(64)"> Sanitization error: Can't kekulize mol
<regId="cyano_pyridone_A(54)"> Sanitization error: Explicit valence for atom #
1 H, 3, is greater than permitted
<regId="het_pyridiniums_A(39)"> Sanitization error: Explicit valence for atom #
5 H, 3, is greater than permitted
<regId="diazox_sulfon_A(36)"> Sanitization error: Explicit valence for atom #
14 C, 5, is greater than permitted
<regId="pyrrole_B(29)"> Sanitization error: Explicit valence for atom # 9 H, 3,
is greater than permitted
<regId="thiophene_hydroxy(28)"> Sanitization error: Can't kekulize mol
<regId="imidazole_A(19)"> Sanitization error: Explicit valence for atom # 4 H,
2, is greater than permitted
<regId="het_6_tetrazine(18)"> Sanitization error: Aromatic bonds on non
aromatic atom 1
Any insight would be greatly appreciated.
Thank you
Christopher R. Bodle
PhD Candidate, University of Iowa
College of Pharmacy
Division of Medicinal and Natural Products Chemistry
115 S. Grand Avenue-Rm. S338
Iowa City, Iowa 52242
(319) 335-7845<tel:%28319%29%20335-7845>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss