Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

Greg Landrum Wed, 23 Nov 2016 02:43:56 -0800

Hi Alex,

The new version of the cartridge has some capabilities that, I think,
address this.


There's a blog post about this: http://rdkit.blogspot.com/
2016/07/tuning-substructure-queries-ii.html
but the short version is that you can do the kind of queries it seems like
you want to do quite simply:

chembl_21=# select * from rdk.mols where
m@>mol_adjust_query_properties('*c1ncccn1')
limit 3;
 molregno |                                           m

----------+-------------------------------------------------
--------------------------------------
   601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1
   289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O
   607646 | CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)
cc4)ncnc32)[C@@H](O)[C@H]1O
(3 rows)

chembl_21=# select * from rdk.mols where
m@>mol_adjust_query_properties('*c1nc(*)ccn1')
limit 3;
 molregno |                           m
----------+-------------------------------------------------------
   158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1
   158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3ccccc3)CC2)n1
   158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1
(3 rows)

chembl_21=# select * from rdk.mols where
m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1')
limit 3;
 molregno |                                    m

----------+-------------------------------------------------
-------------------------
   726443 | CN=C(S)NNc1nc(C)cc(C)n1
   561136 | C[C@H](Nc1cc(NC2CCCCCC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)
cc1)c1cccc(Br)c1
   205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1
(3 rows)

There's more detail in the blog post, but the default behavior is to
convert dummies into generic query atoms and to constrain the substitution
at any other *ring* position.

Best Regards,
-greg


On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja <[email protected]>
wrote:

> Hi all,
>
>
>
> I am currently exploring the possibilities of the RDKit database cartridge
> for substructure search- I installed everything following the  tutorial
> from http://www.rdkit.org/docs/Install.html
>
>
>
> Very nice tutorial  - worked perfectly fine.
>
>
>
> Since we are exploring solutions for browser based gui searches I created
> a test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/)
> which communicates with the database through PHP.
>
>
>
> Ketcher returns a SMILES representation from the drawn molecule. The raw
> data of the molecules in the database are canonical SMILES created from
> RDKIT canonical SMILES from the rdkit KNIME node (they are text-mined from
> patents).
>
>
>
> When doing substructure searches, as long as we query for well-defined
> compounds the results make sense – however looking at R1,…-groups things
> get a little odd.
>
>
>
> I found a very old discussion on the mailing list from 2009 where this has
> been discussed and I understood from that dialog that when looking at
> SMILES with a “*” representation this is interpreted as a dummy atom and
> the same dummy atom is expected in the search space to produce a hit. While
> a SMARTS representation of the same string actually leads to the behaviour
> that “any atom” is matched at that position.
>
>
>
> I ended up with the very cumbersome query, I am sure there are more
> elegant ways of doing this using ::qmol notation, but as I said I am
> currently exploring J
>
>
>
> That’s the query (in PHP) in question for PostgreSQL:
>
>
>
> *$search_result = pg_query($dbconn, "select m from pat.mols where
> m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "')))
> LIMIT 20;"); *
>
>
>
> Extracting rdkit functionality leaves me with:
>
>
>
> *m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles'].
> "')))*
>
> and adding a smiles string to make it more readable:
>
>
>
> *m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('* C([*])1=CC=CC=C1*')))
> (This is how Ketcher creates the smiles string, using explicit double
> bonds)*
>
>
>
> This query does actually work and returns structures that are correct
> (visually inspected a few examples)
>
>
>
> The same query without all the molecule conversion methods does not return
> anything
>
>
>
> *m@>'* C([*])1=CC=CC=C1*'*
>
>
>
> I guess the reason for this is that the default interpretation is smiles
> and it is looking for actual dummy atoms in the database (there are none).
>
>
>
> That’s my first question: Is this assumption correct?
>
>
>
> My next issue is a query with explicit hydrogens:
>
>
>
> Using
>
>
>
> *“C([*])1=C([H])C([H])=C([H])C([H])=C1[H]” *
>
>
>
> as a query with the all the molecule conversion as shown above to make
> SMARTS happen, returns among others:
>
>
>
> *“C(C)1=CC=C(C)C=C1”*
>
>
>
> Which is correct for implicit hydrogens but not for explicit – so my guess
> is they are lost.
>
>
>
> Can I enforce at query time against the cartridge to work with explicit
> hydrogens so that only molecules are found that have different substitutes
> at the “*” position?
>
>
>
> I could not find a pre-defined function for that.
>
>
>
> Thank you very much for any hints or solutions,
>
>
>
> Best regards,
>
>
>
> Alex
>
>
>
>
>
>
>
> Best regards / Mit freundlichen Grüßen / Sincères salutations
>
>
>
> Dr. Alexander Garvin Klenner-Bajaja
>
> Administrator Requirements Engineering-Solution Design | Dir. 2.8.3.3
>
> European Patent Office
>
> Patentlaan 3-9 | 2288 EE Rijswijk | The Netherlands
> Tel. +31(0)70340-1991
>
> [email protected]
>
> www.epo.org
>
>
>
> Please consider the environment before printing this email.
>
>
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

Reply via email to