Re: [Rdkit-discuss] trouble with SMARTs interpretation of 'not hydrogen'

2015-09-17 Thread Bodle, Christopher R
Andrew,

Thank you for the input.  Actually, upon further inspection after you asked for 
a full example, I was looking for a hit compound that was not flagged as a 
PAINS compound because of incorrect interpretation of !#n, and I couldn't find 
any.  In fact when I looked closer at my sanitized PAINS flags, I found that 
the new sanitized filter queries were in fact incorrectly flagging molecules.  
For example flagging a dimethoxybenzene moiety as a catechol.

Thank you for your help in this, and I will keep in mind in the future that it 
is inappropriate to try and sanitize SMARTS queries.

Thanks again


Christopher R. Bodle
PhD Candidate, University of Iowa
College of Pharmacy
Division of Medicinal and Natural Products Chemistry
115 S. Grand Avenue-Rm. S338
Iowa City, Iowa 52242
(319) 335-7845




From: Andrew Dalke [da...@dalkescientific.com]
Sent: Wednesday, September 16, 2015 5:23 PM
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] trouble with SMARTs interpretation of 'not 
hydrogen'

On Sep 16, 2015, at 9:57 PM, Bodle, Christopher R wrote:
> I am having trouble with RDKit correctly interpreting the SMARTS character 
> [!#1], which should be interpreted as "any atom not hydrogen.

I've been looking at your emails but it's difficult for me to figure out what 
you are doing. Can you generate a smaller reproducible?

My guess is that you are looking at the RDKit depiction of a molecule generated 
from a SMARTS string.This is a query molecule. As I recall, this is 
incomplete, and there is an open call out for someone interested in generating 
a better query depiction. If that's the case, then what you see is inability of 
the renderer to display a "not". This shouldn't affect the ability to match a 
molecule.

I also don't understand this:

> My SMARTS input:
> [#6]-1(=[!#1]-[!#1]=[!#1]-[#7](-[#6]-1=[#16])-[#1])-[#6]#[#7]
>
> Now when I do Chem.MolFromSmarts, my mol representation has hydrogens at 
> those three positions, and as such I can't do sanitization of the molecule 
> because since it has hydrogens in the !#1 positions, there is a valency 
> conflict.

It doesn't make sense to me to do sanitization of molecule that came from a 
SMARTS query.

It looks like you have tried to convert a query-based molecule into a more 
chemical molecule. That is, I can reproduce some of what you report by using:

  >>> from rdkit import Chem
  >>> mol = 
Chem.MolFromSmarts("[#6]-1(=[!#1]-[!#1]=[!#1]-[#7](-[#6]-1=[#16])-[#1])-[#6]#[#7]")
  >>> Chem.MolToSmiles(mol)
  '[H]N1[H]=[H][H]=C(C#N)C1=S'

This produces a nearly meaningless conversion. For example, consider:

  >>> mol = Chem.MolFromSmarts("[#92,#93][$(N=N)]")
  >>> Chem.MolToSmiles(mol)
  '[*][U]'
  >>> mol = Chem.MolFromSmarts("[#93,#92][$(N=N)]")
  >>> Chem.MolToSmiles(mol)
  '[*][Np]'

When there is a choice of atoms, it picks the first, given 'U' and 'Np' when I 
swap the two element numbers. And it shows a recursive SMARTS as a '*'.

As far as I can tell, the "[!#1]" works correctly. Here's a case where it 
matches an 'N':

  >>> pat = Chem.MolFromSmarts("C-[!#1]-C")

  >>> mol = Chem.MolFromSmiles("CNC")
  >>> mol.HasSubstructMatch(pat)
  True

RDKit won't parse a 2-valent hydrogen by default:

  >>> mol = Chem.MolFromSmiles("C[H]C")
  [00:15:07] Explicit valence for atom # 1 H, 2, is greater than permitted

but if I disable sanitization, I can show that the pattern doesn't match this 
molecule:

  >>> mol = Chem.MolFromSmiles("C[H]C", sanitize=False)
  >>> mol.HasSubstructMatch(pat)
  False

And to double-check that the sanitize flag isn't doing something odd:

  >>> mol = Chem.MolFromSmiles("C[N]C", sanitize=False)
  >>> mol.HasSubstructMatch(pat)
  True

Since the SMARTS pattern doesn't work for you, but does seem to work for me, 
could you give a test case which is just the SMILES/SMARTS or molfile/SMARTS 
combination which gives the failure? That is, without the incomplete 
scaffolding that you showed.


Cheers,

Andrew
da...@dalkescientific.com



--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
Monito

Re: [Rdkit-discuss] possible SMARTS translating mistake?

2015-09-17 Thread Bodle, Christopher R
All (and Greg),

After responding to Greg's email I read the email from Andrew Dalke for my 
other thread ("trouble with SMARTS interpretation of 'not hydrogen'") who 
informed me that it is not appropriate to to a sanitization of a molecule that 
comes from a SMARTS query, because this converts a query-based molecule in to a 
more chemical molecule and the query molecule loses some of it's query 
properties.  For example I had several molecule in the SMARTS with the first 
carbon atom labeled as [c,C].  During my sanitization it only kept c, which 
then threw up a sanitization error saying a non-aromatic molecule was labeled 
as aromatic.

I now believe that my initial PAINS filtration worked properly, and I just do 
not have very many compounds that were flagged as PAINS in this screen.

I would like to test this against the RDKit in house PAINS filters, but I ran 
in to a problem trying to implement them.  When I tried to run:
from rdkit.Chem import FilterCatalog
I got the error message:

ImportError: cannot import name FilterCatalog


Is there another package that I need to download in order to run the 
FliterCatalog functionality?  I do not see mention of it on this page.  
https://github.com/rdkit/rdkit/pull/536


Additionally I am using python, not C++

Thank you all very much for your help.


Christopher R. Bodle

PhD Candidate, University of Iowa

College of Pharmacy

Division of Medicinal and Natural Products Chemistry

115 S. Grand Avenue-Rm. S338

Iowa City, Iowa 52242

(319) 335-7845



____
From: Bodle, Christopher R [christopher-bo...@uiowa.edu]
Sent: Thursday, September 17, 2015 8:47 AM
To: Greg Landrum
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake?

Greg,

Thanks for the reply.  I will clarify a little bit.

The example provided is one of the SMARTS representations of one of the PAINS 
compounds from Rjarshi Guha's blog.  My goal is to filter my list of hit 
compounds from an HTS campaign against these PAINS filters, primarily by using 
the .HasSubstructMatch function in RDKit.  I had already tested the filtering 
code with additional lists of problematic substructures found in the 
supplemental of Lagorce,Beall et.al. (FAF-Drugs3: a web server for compound 
property calculation and chemical library design), and those worked fine.  For 
example when I ran a filter with the Toxicophore subset, 122 of my 131 hit 
compounds were identified as having one or more toxicophore moieties.  When I 
ran the filtering code with non-standardized PAINS compounds I only got 
substructure matches with 3 of the 516 filter compounds.  It was then suggested 
to me that I should try and standardize the PAINS library.  To do this I found 
a standardizing function using the MolVS package, which is outlined here: 
http://molvs.readthedocs.org/en/latest/guide/intro.html

the standardization process that is utilized by the function s.standardize in 
my code below is outlined lower on that page.

When I filtered using the PAINS library after standardization, I now had 
matches with 10 of the 516 filter compounds, and 42 flagged compounds from the 
hit compound list (vs 21 flagged compounds with a non-standardized filter 
list), but I also had 201 compounds of the 516 that did not produce a 
standardized mol structure.

So I guess what I am trying to accomplish by standardizing the queries is put 
them in a standardized conformation that would allow for better results with 
.HasSubstructMatch.  I see now that one main reason behind the standardization 
not working is because I take a SMARTS string containing query features and try 
to make it a SMILES string for the standardization.  I only did this because 
the examples using MolVS uses a .MolFromSmiles.  So I will first try to simply 
use .MolFromSmarts format to see if that rectifies my problem.  I don't see why 
it wouldn't, since the input for s.standardize is a mol_file.  However if the 
standardization code is based on SMILES format then there may be an issue.  I 
will try today and report back to let the RDKit community know how it goes.

One last question, are there plans to have a new rendering code for python 
based RDKit users as well?

Thank you again Greg,



Christopher R. Bodle

PhD Candidate, University of Iowa

College of Pharmacy

Division of Medicinal and Natural Products Chemistry

115 S. Grand Avenue-Rm. S338

Iowa City, Iowa 52242

(319) 335-7845




From: Greg Landrum [greg.land...@gmail.com]
Sent: Wednesday, September 16, 2015 8:03 PM
To: Bodle, Christopher R
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake?



On Tue, Sep 15, 2015 at 6:48 PM, Bodle, Christopher R 
<christopher-bo...@uiowa.edu<mailto:christopher-bo...@uiowa.edu>> wrote:

I am working on a filtering code in python to search for substructure matches 
against my hit list (in SMILE

Re: [Rdkit-discuss] possible SMARTS translating mistake?

2015-09-16 Thread Bodle, Christopher R
Maciek,

Thank you for the resource.  I actually had based my initial troubleshooting 
efforts off of that blog spot.  In retrospect I should have included that 
information in my original post.  Here is the basic code for how I filter my 
hit list against a filter list.

def get_compound_molfile(Compound_ID):
imax,jmax = inhibitors.shape
mol_file = []
for i in range (imax):
compound_data = inhibitors.iloc[i,:]
if Compound_ID in compound_data.ravel():
mol_file = inhibitors.iloc[i,21]
else:
mol_file = mol_file
return mol_file

def filter_hits(mol_file,filter_list):
imax,jmax = filter_list.shape
filter_matches = []
for i in range(imax):
filter_compound_molfile = fcm = filter_list.iloc[i,2]
mol_fileh = mfh = Chem.AddHs(mol_file)
fcmh = Chem.MergeQueryHs(fcm)
result = mfh.HasSubstructMatch(fcmh)
if result:
filter_matches.append(filter_list.iloc[i,1])
else:
continue
if len(filter_matches)>0:
return str(filter_matches)
else:
return np.nan

def filter_hit_list(hit_list, filter_list):
filterd_list = hit_list.copy()
imax,jmax = hit_list.shape
for i in range (imax):
Compound_ID = hit_list.iloc[i,0]
m = get_compound_molfile(Compound_ID)
p = filter_hits(m,filter_list)
filterd_list.iloc[i,jmax-1] = str(p)
return filterd_list

In the second function (filter_hits) I add Hs to the hit compound mol_file with 
Chem.AddHs, and I merge the Hs to the filter_list compound mol_file with 
Chem.MergeQueryHs.  Since the blog mentioned in your e mail showed that the 
HasSubstructMatch function works when both inputs have their respective 
hydrogens in the structure representation, I decided to cover my basis and make 
sure I wasn't missing any hydrogens from either species.



Christopher R. Bodle

PhD Candidate, University of Iowa

College of Pharmacy

Division of Medicinal and Natural Products Chemistry

115 S. Grand Avenue-Rm. S338

Iowa City, Iowa 52242

(319) 335-7845




From: Maciek Wójcikowski [mac...@wojcikowski.pl]
Sent: Wednesday, September 16, 2015 3:22 AM
To: Bodle, Christopher R
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake?

Hi Christopher,

Since you're mentioning Rajarshi's SMARTS, I guess that you haven't seen Greg's 
latest revision of PAINS filters (see 
http://rdkit.blogspot.com.es/2015/08/curating-pains-filters.html). On the other 
hand, during RDKit UGM I remember Greg saying that some of the filters would 
require changes to RDKit's aromatic model, and this one seams to be the case 
(Greg might confirm/check?).

Best,
Maciej

2015-09-15 18:48 GMT+02:00 Bodle, Christopher R 
<christopher-bo...@uiowa.edu<mailto:christopher-bo...@uiowa.edu>>:
All,

I am working on a filtering code in python to search for substructure matches 
against my hit list (in SMILES) and my filter lists (in SMARTS).  My current 
filter lists were copied from Rajarshi Guha's blog at 
http://blog.rguha.net/?p=850.

While working on this I was working with the following SMARTS string from the 
p_l150 collection, filter purrole_A(118):


n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]

I have highlighted the problem area in the string.  Although this should be 
interpreted as 'not H', the rendering generated from Chem.MolFromSmarts does 
indeed result in a hydrogen in this position, which is in the middle of an 
aromatic ring and results in a valency issue and as such I can't standardize 
the mol for filtering purposes.

I confirmed this by making the following edit to the SMILES string:
n2(-[#6]:1:[!#6]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]

Which results in a carbon in the position of the hydrogen from the original 
SMARTS.  Is this a problem with the SMARTS translator?  Or is there something 
that I am missing?

I believe this happens quite frequently.  When running a standardization code 
for the filter p_l150 (55 compounds) using:

p_l150['standardized mol']=''
imax,jmax = p_l150.shape
for i in range(imax):
mol_file =mf= p_l150.loc[i,'mol file']
s = Standardizer()
try:
m = Chem.MolToSmiles(mf)
m2 = standardize_smiles(m)
m3 = Chem.MolFromSmiles(m2)
smol = s.standardize(m3)
p_l150.loc[i,'standardized mol'] = smol
except Exception as e:
print p_l150.loc[i,'filter'], e
p_l150

I return 11 errors, 8 of which are valency (7 of those involve hydrogens):



[Rdkit-discuss] trouble with SMARTs interpretation of 'not hydrogen'

2015-09-16 Thread Bodle, Christopher R
All,

I touched on this subject yesterday, but wanted to add some more information 
today as I didn't receive a response yet.  I am having trouble with RDKit 
correctly interpreting the SMARTS character [!#1], which should be interpreted 
as "any atom not hydrogen.  Let me give you an example:

My SMARTS input:

[#6]-1(=[!#1]-[!#1]=[!#1]-[#7](-[#6]-1=[#16])-[#1])-[#6]#[#7]


Now when I do Chem.MolFromSmarts, my mol representation has hydrogens at those 
three positions, and as such I can't do sanitization of the molecule because 
since it has hydrogens in the !#1 positions, there is a valency conflict.


I confirm that it does indeed insert hydrogens in to the formula by performing 
Chem.MolToSmiles of the mol_file generated previously, which returns:

[H]N1[H]=[H][H]=C(C#N)C1=S


Interestingly, augmenting the original SMARTS string to include * (wild card 
any atom) in those three !#1 positions returns NONE.


Has anyone else encountered this problem with !#n?


Thank you


Christopher R. Bodle

PhD Candidate, University of Iowa

College of Pharmacy

Division of Medicinal and Natural Products Chemistry

115 S. Grand Avenue-Rm. S338

Iowa City, Iowa 52242

(319) 335-7845


--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] possible SMARTS translating mistake?

2015-09-15 Thread Bodle, Christopher R
All,

I am working on a filtering code in python to search for substructure matches 
against my hit list (in SMILES) and my filter lists (in SMARTS).  My current 
filter lists were copied from Rajarshi Guha's blog at 
http://blog.rguha.net/?p=850.

While working on this I was working with the following SMARTS string from the 
p_l150 collection, filter purrole_A(118):


n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]

I have highlighted the problem area in the string.  Although this should be 
interpreted as 'not H', the rendering generated from Chem.MolFromSmarts does 
indeed result in a hydrogen in this position, which is in the middle of an 
aromatic ring and results in a valency issue and as such I can't standardize 
the mol for filtering purposes.

I confirmed this by making the following edit to the SMILES string:
n2(-[#6]:1:[!#6]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]

Which results in a carbon in the position of the hydrogen from the original 
SMARTS.  Is this a problem with the SMARTS translator?  Or is there something 
that I am missing?

I believe this happens quite frequently.  When running a standardization code 
for the filter p_l150 (55 compounds) using:

p_l150['standardized mol']=''
imax,jmax = p_l150.shape
for i in range(imax):
mol_file =mf= p_l150.loc[i,'mol file']
s = Standardizer()
try:
m = Chem.MolToSmiles(mf)
m2 = standardize_smiles(m)
m3 = Chem.MolFromSmiles(m2)
smol = s.standardize(m3)
p_l150.loc[i,'standardized mol'] = smol
except Exception as e:
print p_l150.loc[i,'filter'], e
p_l150

I return 11 errors, 8 of which are valency (7 of those involve hydrogens):