On Tue, Sep 15, 2015 at 6:48 PM, Bodle, Christopher R <
christopher-bo...@uiowa.edu> wrote:

>
> I am working on a filtering code in python to search for substructure
> matches against my hit list (in SMILES) and my filter lists (in SMARTS).
> My current filter lists were copied from Rajarshi Guha's blog at
> http://blog.rguha.net/?p=850.
>

The topic of the new version of these SMARTS and the associated blog post
came up elsewhere in the thread, so I don't need to raise that again.

While working on this I was working with the following SMARTS string from
> the p_l150 collection, filter purrole_A(118):
>
> n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]
>
>
> I have highlighted the problem area in the string.  Although this should
> be interpreted as 'not H', the rendering generated from Chem.MolFromSmarts
> does indeed result in a hydrogen in this position,
>

The rendering of molecules is not necessarily the best way to discover what
the RDKit actually thinks they are. The rendering code uses the atomic
number to determine what to use for atom labels, this doesn't make a lot of
sense for queries. The C++-based rendering in the new version will do a
more expliciti job here by indicating that there is a query feature.
There's an example in "Out [22]" here:
https://github.com/rdkit/UGM_2015/blob/master/Notebooks/Whats_new.ipynb

For molecules with query features: the best approach to determine what the
RDKit is doing is to use SMARTS. In your case this gives:

In [2]: m =
Chem.MolFromSmarts('n2(-[#6]:1:[!#1]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]')

In [3]: Chem.MolToSmarts(m)
Out[3]:
'n1(-[#6]2:[!#1]:[#6]:[#6]:[#6]:[#6]:2):,-c(-,:c:,-c(-,:c:,-1-[#6&X4])-[#1])-[#6&X4]'

In [4]: nm = Chem.MergeQueryHs(m)

In [5]: Chem.MolToSmarts(nm)
Out[5]:
'n1(-[#6]2:[!#1]:[#6]:[#6]:[#6]:[#6]:2):,-c(-,:c:,-[c&!H0]-,:c:,-1-[#6&X4])-[#6&X4]'

You can see that the [!#1] is preserved both before and after the
MergeQueryHs() call.

You can also test that the query is working properly by running it against
molecules that you know should work. Here's a simple example of that which
includes [!#1]:

In [6]: sm  = Chem.MolFromSmarts('[#6][!#1][#6]')

In [7]: Chem.MolToSmarts(sm)
Out[7]: '[#6]-,:[!#1]-,:[#6]'

In [8]: Chem.MolFromSmiles('CNC').GetSubstructMatch(sm)
Out[8]: (0, 1, 2)

In [9]: Chem.MolFromSmiles('CCC').GetSubstructMatch(sm)
Out[9]: (0, 1, 2)

In [10]: Chem.MolFromSmiles('C[H-]C').GetSubstructMatch(sm)
Out[10]: ()

which is in the middle of an aromatic ring and results in a valency issue
> and as such I can't standardize the mol for filtering purposes.
>

This "standardization" is what's causing the problems I think.


> I confirmed this by making the following edit to the SMILES string:
> n2(-[#6]:1:[!#6]:[#6]:[#6]:[#6]:[#6]:1)c(cc(c2-[#6;X4])-[#1])-[#6;X4]
>
> Which results in a carbon in the position of the hydrogen from the
> original SMARTS.  Is this a problem with the SMARTS translator?  Or is
> there something that I am missing?
>

The above string is a SMARTS. It is not a valid SMILES since it contains
query features. A molecule constructed from it using MolFromSmarts() also
contains query features and so when you try to convert it to SMILES (as
your code below does), you don't get an accurate representation.

What are you trying to accomplish by standardizing the queries and what are
you actually doing when you standardize them?

Best,
-greg



> I believe this happens quite frequently.  When running a standardization
> code for the filter p_l150 (55 compounds) using:
>
> p_l150['standardized mol']=''
> imax,jmax = p_l150.shape
> for i in range(imax):
>     mol_file =mf= p_l150.loc[i,'mol file']
>     s = Standardizer()
>     try:
>         m = Chem.MolToSmiles(mf)
>         m2 = standardize_smiles(m)
>         m3 = Chem.MolFromSmiles(m2)
>         smol = s.standardize(m3)
>         p_l150.loc[i,'standardized mol'] = smol
>     except Exception as e:
>         print p_l150.loc[i,'filter'], e
> p_l150
>
> I return 11 errors, 8 of which are valency (7 of those involve hydrogens):
>
> <regId="pyrrole_A(118)"> Sanitization error: Explicit valence for atom # 8 H, 
> 3, is greater than permitted
> <regId="imine_one_fives(89)"> Sanitization error: Explicit valence for atom # 
> 3 H, 3, is greater than permitted
> <regId="hzone_pipzn(79)"> Sanitization error: Explicit valence for atom # 3 
> H, 2, is greater than permitted
> <regId="hzone_pyrrol(64)"> Sanitization error: Can't kekulize mol
>
> <regId="cyano_pyridone_A(54)"> Sanitization error: Explicit valence for atom 
> # 1 H, 3, is greater than permitted
> <regId="het_pyridiniums_A(39)"> Sanitization error: Explicit valence for atom 
> # 5 H, 3, is greater than permitted
> <regId="diazox_sulfon_A(36)"> Sanitization error: Explicit valence for atom # 
> 14 C, 5, is greater than permitted
> <regId="pyrrole_B(29)"> Sanitization error: Explicit valence for atom # 9 H, 
> 3, is greater than permitted
> <regId="thiophene_hydroxy(28)"> Sanitization error: Can't kekulize mol
>
> <regId="imidazole_A(19)"> Sanitization error: Explicit valence for atom # 4 
> H, 2, is greater than permitted
> <regId="het_6_tetrazine(18)"> Sanitization error: Aromatic bonds on non 
> aromatic atom 1
>
>
> Any insight would be greatly appreciated.
>
>
> Thank you
>
>
> Christopher R. Bodle
>
> PhD Candidate, University of Iowa
>
> College of Pharmacy
>
> Division of Medicinal and Natural Products Chemistry
>
> 115 S. Grand Avenue-Rm. S338
>
> Iowa City, Iowa 52242
>
> (319) 335-7845
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to