Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

Markus Sitzmann Wed, 23 Nov 2016 06:40:58 -0800

If I understood Greg correctly, it will be in 2016.09 which isn't in conda just 
of yet, they are currently working on putting it there.


Markus

-------------------------------------
|  Markus Sitzmann
|  markus.sitzm...@gmail.com

> On 23 Nov 2016, at 15:29, Alexander Klenner-Bajaja <aklen...@epo.org> wrote:
> 
> Dear Greg,
>  
> Thank you very much, looking at the results that function was exactly what I 
> was looking for – only I can’t find it in my updated anaconda installation.
>  
> “conda update rdkit” tells me I have the latest version 2016.03.4 and 
> postgres tells me I have the 3.4 version of the RDKit extension
>  
> If I understand your blog post correctly it should be in 2016.03 version? 
> What am I missing?
>  
>  
> Best,
>  
> Alex
>  
>  
>  
> From: Greg Landrum [mailto:greg.land...@gmail.com] 
> Sent: Wednesday, November 23, 2016 11:42 AM
> To: Alexander Klenner-Bajaja
> Cc: rdkit-discuss@lists.sourceforge.net
> Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit 
> hydrogens
>  
> Hi Alex,
>  
> The new version of the cartridge has some capabilities that, I think, address 
> this.
>  
> There's a blog post about this: 
> http://rdkit.blogspot.com/2016/07/tuning-substructure-queries-ii.html
> but the short version is that you can do the kind of queries it seems like 
> you want to do quite simply:
>  
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1ncccn1') limit 3;
>  molregno |                                           m                       
>                     
> ----------+---------------------------------------------------------------------------------------
>    601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1
>    289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O
>    607646 | 
> CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc4)ncnc32)[C@@H](O)[C@H]1O
> (3 rows)
>  
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1nc(*)ccn1') limit 3;
>  molregno |                           m                           
> ----------+-------------------------------------------------------
>    158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1
>    158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3ccccc3)CC2)n1
>    158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1
> (3 rows)
>  
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') limit 3;
>  molregno |                                    m                              
>        
> ----------+--------------------------------------------------------------------------
>    726443 | CN=C(S)NNc1nc(C)cc(C)n1
>    561136 | 
> C[C@H](Nc1cc(NC2CCCCCC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1)c1cccc(Br)c1
>    205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1
> (3 rows)
>  
> There's more detail in the blog post, but the default behavior is to convert 
> dummies into generic query atoms and to constrain the substitution at any 
> other *ring* position.
>  
> Best Regards,
> -greg
>  
>  
> On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja <aklen...@epo.org> 
> wrote:
> Hi all,
>  
> I am currently exploring the possibilities of the RDKit database cartridge 
> for substructure search- I installed everything following the  tutorial from 
> http://www.rdkit.org/docs/Install.html
>  
> Very nice tutorial  - worked perfectly fine.
>  
> Since we are exploring solutions for browser based gui searches I created a 
> test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) 
> which communicates with the database through PHP.
>  
> Ketcher returns a SMILES representation from the drawn molecule. The raw data 
> of the molecules in the database are canonical SMILES created from RDKIT 
> canonical SMILES from the rdkit KNIME node (they are text-mined from patents).
>  
> When doing substructure searches, as long as we query for well-defined 
> compounds the results make sense – however looking at R1,…-groups things get 
> a little odd.
>  
> I found a very old discussion on the mailing list from 2009 where this has 
> been discussed and I understood from that dialog that when looking at SMILES 
> with a “*” representation this is interpreted as a dummy atom and the same 
> dummy atom is expected in the search space to produce a hit. While a SMARTS 
> representation of the same string actually leads to the behaviour that “any 
> atom” is matched at that position.
>  
> I ended up with the very cumbersome query, I am sure there are more elegant 
> ways of doing this using ::qmol notation, but as I said I am currently 
> exploring J
>  
> That’s the query (in PHP) in question for PostgreSQL:
>  
> $search_result = pg_query($dbconn, "select m from pat.mols where 
> m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "'))) 
> LIMIT 20;");
>  
> Extracting rdkit functionality leaves me with:
>  
> m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "')))
> and adding a smiles string to make it more readable:
>  
> m@>mol_from_smarts(mol_to_smiles(mol_from_smiles(' C([*])1=CC=CC=C1')))   
> (This is how Ketcher creates the smiles string, using explicit double bonds)
>  
> This query does actually work and returns structures that are correct 
> (visually inspected a few examples)
>  
> The same query without all the molecule conversion methods does not return 
> anything
>  
> m@>' C([*])1=CC=CC=C1'
>  
> I guess the reason for this is that the default interpretation is smiles and 
> it is looking for actual dummy atoms in the database (there are none).
>  
> That’s my first question: Is this assumption correct?
>  
> My next issue is a query with explicit hydrogens:
>  
> Using
>  
> “C([*])1=C([H])C([H])=C([H])C([H])=C1[H]”
>  
> as a query with the all the molecule conversion as shown above to make SMARTS 
> happen, returns among others:
>  
> “C(C)1=CC=C(C)C=C1”
>  
> Which is correct for implicit hydrogens but not for explicit – so my guess is 
> they are lost.
>  
> Can I enforce at query time against the cartridge to work with explicit 
> hydrogens so that only molecules are found that have different substitutes at 
> the “*” position?
>  
> I could not find a pre-defined function for that.
>  
> Thank you very much for any hints or solutions,
>  
> Best regards,
>  
> Alex
>  
>  
>  
> Best regards / Mit freundlichen Grüßen / Sincères salutations
>  
> Dr. Alexander Garvin Klenner-Bajaja
> Administrator Requirements Engineering-Solution Design | Dir. 2.8.3.3
> European Patent Office
> Patentlaan 3-9 | 2288 EE Rijswijk | The Netherlands
> Tel. +31(0)70340-1991
> aklen...@epo.org
> www.epo.org
>  
> Please consider the environment before printing this email.
>  
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
>  
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

Reply via email to