Re: [Rdkit-discuss] trouble with SMARTs interpretation of 'not hydrogen'
Andrew, Thank you for the input. Actually, upon further inspection after you asked for a full example, I was looking for a hit compound that was not flagged as a PAINS compound because of incorrect interpretation of !#n, and I couldn't find any. In fact when I looked closer at my sanitized PAINS flags, I found that the new sanitized filter queries were in fact incorrectly flagging molecules. For example flagging a dimethoxybenzene moiety as a catechol. Thank you for your help in this, and I will keep in mind in the future that it is inappropriate to try and sanitize SMARTS queries. Thanks again Christopher R. Bodle PhD Candidate, University of Iowa College of Pharmacy Division of Medicinal and Natural Products Chemistry 115 S. Grand Avenue-Rm. S338 Iowa City, Iowa 52242 (319) 335-7845 From: Andrew Dalke [da...@dalkescientific.com] Sent: Wednesday, September 16, 2015 5:23 PM Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] trouble with SMARTs interpretation of 'not hydrogen' On Sep 16, 2015, at 9:57 PM, Bodle, Christopher R wrote: > I am having trouble with RDKit correctly interpreting the SMARTS character > [!#1], which should be interpreted as "any atom not hydrogen. I've been looking at your emails but it's difficult for me to figure out what you are doing. Can you generate a smaller reproducible? My guess is that you are looking at the RDKit depiction of a molecule generated from a SMARTS string.This is a query molecule. As I recall, this is incomplete, and there is an open call out for someone interested in generating a better query depiction. If that's the case, then what you see is inability of the renderer to display a "not". This shouldn't affect the ability to match a molecule. I also don't understand this: > My SMARTS input: > [#6]-1(=[!#1]-[!#1]=[!#1]-[#7](-[#6]-1=[#16])-[#1])-[#6]#[#7] > > Now when I do Chem.MolFromSmarts, my mol representation has hydrogens at > those three positions, and as such I can't do sanitization of the molecule > because since it has hydrogens in the !#1 positions, there is a valency > conflict. It doesn't make sense to me to do sanitization of molecule that came from a SMARTS query. It looks like you have tried to convert a query-based molecule into a more chemical molecule. That is, I can reproduce some of what you report by using: >>> from rdkit import Chem >>> mol = Chem.MolFromSmarts("[#6]-1(=[!#1]-[!#1]=[!#1]-[#7](-[#6]-1=[#16])-[#1])-[#6]#[#7]") >>> Chem.MolToSmiles(mol) '[H]N1[H]=[H][H]=C(C#N)C1=S' This produces a nearly meaningless conversion. For example, consider: >>> mol = Chem.MolFromSmarts("[#92,#93][$(N=N)]") >>> Chem.MolToSmiles(mol) '[*][U]' >>> mol = Chem.MolFromSmarts("[#93,#92][$(N=N)]") >>> Chem.MolToSmiles(mol) '[*][Np]' When there is a choice of atoms, it picks the first, given 'U' and 'Np' when I swap the two element numbers. And it shows a recursive SMARTS as a '*'. As far as I can tell, the "[!#1]" works correctly. Here's a case where it matches an 'N': >>> pat = Chem.MolFromSmarts("C-[!#1]-C") >>> mol = Chem.MolFromSmiles("CNC") >>> mol.HasSubstructMatch(pat) True RDKit won't parse a 2-valent hydrogen by default: >>> mol = Chem.MolFromSmiles("C[H]C") [00:15:07] Explicit valence for atom # 1 H, 2, is greater than permitted but if I disable sanitization, I can show that the pattern doesn't match this molecule: >>> mol = Chem.MolFromSmiles("C[H]C", sanitize=False) >>> mol.HasSubstructMatch(pat) False And to double-check that the sanitize flag isn't doing something odd: >>> mol = Chem.MolFromSmiles("C[N]C", sanitize=False) >>> mol.HasSubstructMatch(pat) True Since the SMARTS pattern doesn't work for you, but does seem to work for me, could you give a test case which is just the SMILES/SMARTS or molfile/SMARTS combination which gives the failure? That is, without the incomplete scaffolding that you showed. Cheers, Andrew da...@dalkescientific.com -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Monito
Re: [Rdkit-discuss] possible SMARTS translating mistake?
All (and Greg), After responding to Greg's email I read the email from Andrew Dalke for my other thread ("trouble with SMARTS interpretation of 'not hydrogen'") who informed me that it is not appropriate to to a sanitization of a molecule that comes from a SMARTS query, because this converts a query-based molecule in to a more chemical molecule and the query molecule loses some of it's query properties. For example I had several molecule in the SMARTS with the first carbon atom labeled as [c,C]. During my sanitization it only kept c, which then threw up a sanitization error saying a non-aromatic molecule was labeled as aromatic. I now believe that my initial PAINS filtration worked properly, and I just do not have very many compounds that were flagged as PAINS in this screen. I would like to test this against the RDKit in house PAINS filters, but I ran in to a problem trying to implement them. When I tried to run: from rdkit.Chem import FilterCatalog I got the error message: ImportError: cannot import name FilterCatalog Is there another package that I need to download in order to run the FliterCatalog functionality? I do not see mention of it on this page. https://github.com/rdkit/rdkit/pull/536 Additionally I am using python, not C++ Thank you all very much for your help. Christopher R. Bodle PhD Candidate, University of Iowa College of Pharmacy Division of Medicinal and Natural Products Chemistry 115 S. Grand Avenue-Rm. S338 Iowa City, Iowa 52242 (319) 335-7845 ____ From: Bodle, Christopher R [christopher-bo...@uiowa.edu] Sent: Thursday, September 17, 2015 8:47 AM To: Greg Landrum Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake? Greg, Thanks for the reply. I will clarify a little bit. The example provided is one of the SMARTS representations of one of the PAINS compounds from Rjarshi Guha's blog. My goal is to filter my list of hit compounds from an HTS campaign against these PAINS filters, primarily by using the .HasSubstructMatch function in RDKit. I had already tested the filtering code with additional lists of problematic substructures found in the supplemental of Lagorce,Beall et.al. (FAF-Drugs3: a web server for compound property calculation and chemical library design), and those worked fine. For example when I ran a filter with the Toxicophore subset, 122 of my 131 hit compounds were identified as having one or more toxicophore moieties. When I ran the filtering code with non-standardized PAINS compounds I only got substructure matches with 3 of the 516 filter compounds. It was then suggested to me that I should try and standardize the PAINS library. To do this I found a standardizing function using the MolVS package, which is outlined here: http://molvs.readthedocs.org/en/latest/guide/intro.html the standardization process that is utilized by the function s.standardize in my code below is outlined lower on that page. When I filtered using the PAINS library after standardization, I now had matches with 10 of the 516 filter compounds, and 42 flagged compounds from the hit compound list (vs 21 flagged compounds with a non-standardized filter list), but I also had 201 compounds of the 516 that did not produce a standardized mol structure. So I guess what I am trying to accomplish by standardizing the queries is put them in a standardized conformation that would allow for better results with .HasSubstructMatch. I see now that one main reason behind the standardization not working is because I take a SMARTS string containing query features and try to make it a SMILES string for the standardization. I only did this because the examples using MolVS uses a .MolFromSmiles. So I will first try to simply use .MolFromSmarts format to see if that rectifies my problem. I don't see why it wouldn't, since the input for s.standardize is a mol_file. However if the standardization code is based on SMILES format then there may be an issue. I will try today and report back to let the RDKit community know how it goes. One last question, are there plans to have a new rendering code for python based RDKit users as well? Thank you again Greg, Christopher R. Bodle PhD Candidate, University of Iowa College of Pharmacy Division of Medicinal and Natural Products Chemistry 115 S. Grand Avenue-Rm. S338 Iowa City, Iowa 52242 (319) 335-7845 From: Greg Landrum [greg.land...@gmail.com] Sent: Wednesday, September 16, 2015 8:03 PM To: Bodle, Christopher R Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake? On Tue, Sep 15, 2015 at 6:48 PM, Bodle, Christopher R <christopher-bo...@uiowa.edu<mailto:christopher-bo...@uiowa.edu>> wrote: I am working on a filtering code in python to search for substructure matches against my hit list (in SMILE
Re: [Rdkit-discuss] possible SMARTS translating mistake?
Maciek, Thank you for the resource. I actually had based my initial troubleshooting efforts off of that blog spot. In retrospect I should have included that information in my original post. Here is the basic code for how I filter my hit list against a filter list. def get_compound_molfile(Compound_ID): imax,jmax = inhibitors.shape mol_file = [] for i in range (imax): compound_data = inhibitors.iloc[i,:] if Compound_ID in compound_data.ravel(): mol_file = inhibitors.iloc[i,21] else: mol_file = mol_file return mol_file def filter_hits(mol_file,filter_list): imax,jmax = filter_list.shape filter_matches = [] for i in range(imax): filter_compound_molfile = fcm = filter_list.iloc[i,2] mol_fileh = mfh = Chem.AddHs(mol_file) fcmh = Chem.MergeQueryHs(fcm) result = mfh.HasSubstructMatch(fcmh) if result: filter_matches.append(filter_list.iloc[i,1]) else: continue if len(filter_matches)>0: return str(filter_matches) else: return np.nan def filter_hit_list(hit_list, filter_list): filterd_list = hit_list.copy() imax,jmax = hit_list.shape for i in range (imax): Compound_ID = hit_list.iloc[i,0] m = get_compound_molfile(Compound_ID) p = filter_hits(m,filter_list) filterd_list.iloc[i,jmax-1] = str(p) return filterd_list In the second function (filter_hits) I add Hs to the hit compound mol_file with Chem.AddHs, and I merge the Hs to the filter_list compound mol_file with Chem.MergeQueryHs. Since the blog mentioned in your e mail showed that the HasSubstructMatch function works when both inputs have their respective hydrogens in the structure representation, I decided to cover my basis and make sure I wasn't missing any hydrogens from either species. Christopher R. Bodle PhD Candidate, University of Iowa College of Pharmacy Division of Medicinal and Natural Products Chemistry 115 S. Grand Avenue-Rm. S338 Iowa City, Iowa 52242 (319) 335-7845 From: Maciek Wójcikowski [mac...@wojcikowski.pl] Sent: Wednesday, September 16, 2015 3:22 AM To: Bodle, Christopher R Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] possible SMARTS translating mistake? Hi Christopher, Since you're mentioning Rajarshi's SMARTS, I guess that you haven't seen Greg's latest revision of PAINS filters (see http://rdkit.blogspot.com.es/2015/08/curating-pains-filters.html). On the other hand, during RDKit UGM I remember Greg saying that some of the filters would require changes to RDKit's aromatic model, and this one seams to be the case (Greg might confirm/check?). Best, Maciej 2015-09-15 18:48 GMT+02:00 Bodle, Christopher R <christopher-bo...@uiowa.edu<mailto:christopher-bo...@uiowa.edu>>: All, I am working on a filtering code in python to search for substructure matches against my hit list (in SMILES) and my filter lists (in SMARTS). My current filter lists were copied from Rajarshi Guha's blog at http://blog.rguha.net/?p=850. While working on this I was working with the following SMARTS string from the p_l150 collection, filter purrole_A(118): n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4] I have highlighted the problem area in the string. Although this should be interpreted as 'not H', the rendering generated from Chem.MolFromSmarts does indeed result in a hydrogen in this position, which is in the middle of an aromatic ring and results in a valency issue and as such I can't standardize the mol for filtering purposes. I confirmed this by making the following edit to the SMILES string: n2(-[#6]:1:[!#6]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4] Which results in a carbon in the position of the hydrogen from the original SMARTS. Is this a problem with the SMARTS translator? Or is there something that I am missing? I believe this happens quite frequently. When running a standardization code for the filter p_l150 (55 compounds) using: p_l150['standardized mol']='' imax,jmax = p_l150.shape for i in range(imax): mol_file =mf= p_l150.loc[i,'mol file'] s = Standardizer() try: m = Chem.MolToSmiles(mf) m2 = standardize_smiles(m) m3 = Chem.MolFromSmiles(m2) smol = s.standardize(m3) p_l150.loc[i,'standardized mol'] = smol except Exception as e: print p_l150.loc[i,'filter'], e p_l150 I return 11 errors, 8 of which are valency (7 of those involve hydrogens):
[Rdkit-discuss] trouble with SMARTs interpretation of 'not hydrogen'
All, I touched on this subject yesterday, but wanted to add some more information today as I didn't receive a response yet. I am having trouble with RDKit correctly interpreting the SMARTS character [!#1], which should be interpreted as "any atom not hydrogen. Let me give you an example: My SMARTS input: [#6]-1(=[!#1]-[!#1]=[!#1]-[#7](-[#6]-1=[#16])-[#1])-[#6]#[#7] Now when I do Chem.MolFromSmarts, my mol representation has hydrogens at those three positions, and as such I can't do sanitization of the molecule because since it has hydrogens in the !#1 positions, there is a valency conflict. I confirm that it does indeed insert hydrogens in to the formula by performing Chem.MolToSmiles of the mol_file generated previously, which returns: [H]N1[H]=[H][H]=C(C#N)C1=S Interestingly, augmenting the original SMARTS string to include * (wild card any atom) in those three !#1 positions returns NONE. Has anyone else encountered this problem with !#n? Thank you Christopher R. Bodle PhD Candidate, University of Iowa College of Pharmacy Division of Medicinal and Natural Products Chemistry 115 S. Grand Avenue-Rm. S338 Iowa City, Iowa 52242 (319) 335-7845 -- Monitor Your Dynamic Infrastructure at Any Scale With Datadog! Get real-time metrics from all of your servers, apps and tools in one place. SourceForge users - Click here to start your Free Trial of Datadog now! http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] possible SMARTS translating mistake?
All, I am working on a filtering code in python to search for substructure matches against my hit list (in SMILES) and my filter lists (in SMARTS). My current filter lists were copied from Rajarshi Guha's blog at http://blog.rguha.net/?p=850. While working on this I was working with the following SMARTS string from the p_l150 collection, filter purrole_A(118): n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4] I have highlighted the problem area in the string. Although this should be interpreted as 'not H', the rendering generated from Chem.MolFromSmarts does indeed result in a hydrogen in this position, which is in the middle of an aromatic ring and results in a valency issue and as such I can't standardize the mol for filtering purposes. I confirmed this by making the following edit to the SMILES string: n2(-[#6]:1:[!#6]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4] Which results in a carbon in the position of the hydrogen from the original SMARTS. Is this a problem with the SMARTS translator? Or is there something that I am missing? I believe this happens quite frequently. When running a standardization code for the filter p_l150 (55 compounds) using: p_l150['standardized mol']='' imax,jmax = p_l150.shape for i in range(imax): mol_file =mf= p_l150.loc[i,'mol file'] s = Standardizer() try: m = Chem.MolToSmiles(mf) m2 = standardize_smiles(m) m3 = Chem.MolFromSmiles(m2) smol = s.standardize(m3) p_l150.loc[i,'standardized mol'] = smol except Exception as e: print p_l150.loc[i,'filter'], e p_l150 I return 11 errors, 8 of which are valency (7 of those involve hydrogens):