[Rdkit-discuss] 2016.09 (Q3 2016) RDKit Release

2016-11-23 Thread Greg Landrum
Dear all,

I'm pleased to announce that the next version of the RDKit -- 2016.09
(a.k.a. Q3 2016) -- is released. This one is even later than usual since
the RDKit UGM was quite late this year. And then we hit some problems with
the python 2.7 builds on Windows .

The release notes are below.

The release files are on the github release page:
https://github.com/rdkit/rdkit/releases/tag/Release_2016_09_2
Unless there is demand for it, I do not plan to create Windows binaries
this time. Python binaries are best installed using conda and the Java
binaries were very rarely downloaded.

We are in the process of updating the conda build scripts to reflect the
new version and uploading the binaries to anaconda.org (https://anaconda
.org/rdkit).

Some things that will be finished over the next couple of days:
- The conda build scripts will be updated to reflect the new version and
new conda builds will be available in the RDKit channel at anaconda.org (
https://anaconda.org/rdkit).
- The homebrew script
- The online version of the documentation at rdkit.org

Thanks to everyone who submitted bug reports and suggestions for
this release!

Please let me know if you find any problems with the release or have
suggestions for the next one, which is scheduled for March 2017.

Best Regards,
-greg
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

2016-11-23 Thread Alexander Klenner-Bajaja
Hi all,

I am currently exploring the possibilities of the RDKit database cartridge for 
substructure search- I installed everything following the  tutorial from 
http://www.rdkit.org/docs/Install.html

Very nice tutorial  - worked perfectly fine.

Since we are exploring solutions for browser based gui searches I created a 
test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) which 
communicates with the database through PHP.

Ketcher returns a SMILES representation from the drawn molecule. The raw data 
of the molecules in the database are canonical SMILES created from RDKIT 
canonical SMILES from the rdkit KNIME node (they are text-mined from patents).

When doing substructure searches, as long as we query for well-defined 
compounds the results make sense - however looking at R1,...-groups things get 
a little odd.

I found a very old discussion on the mailing list from 2009 where this has been 
discussed and I understood from that dialog that when looking at SMILES with a 
"*" representation this is interpreted as a dummy atom and the same dummy atom 
is expected in the search space to produce a hit. While a SMARTS representation 
of the same string actually leads to the behaviour that "any atom" is matched 
at that position.

I ended up with the very cumbersome query, I am sure there are more elegant 
ways of doing this using ::qmol notation, but as I said I am currently 
exploring :)

That's the query (in PHP) in question for PostgreSQL:

$search_result = pg_query($dbconn, "select m from pat.mols where 
m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "'))) 
LIMIT 20;");

Extracting rdkit functionality leaves me with:

m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "')))
and adding a smiles string to make it more readable:

m@>mol_from_smarts(mol_to_smiles(mol_from_smiles(' C([*])1=CC=CC=C1')))   (This 
is how Ketcher creates the smiles string, using explicit double bonds)

This query does actually work and returns structures that are correct (visually 
inspected a few examples)

The same query without all the molecule conversion methods does not return 
anything

m@>' C([*])1=CC=CC=C1'

I guess the reason for this is that the default interpretation is smiles and it 
is looking for actual dummy atoms in the database (there are none).

That's my first question: Is this assumption correct?

My next issue is a query with explicit hydrogens:

Using

"C([*])1=C([H])C([H])=C([H])C([H])=C1[H]"

as a query with the all the molecule conversion as shown above to make SMARTS 
happen, returns among others:

"C(C)1=CC=C(C)C=C1"

Which is correct for implicit hydrogens but not for explicit - so my guess is 
they are lost.

Can I enforce at query time against the cartridge to work with explicit 
hydrogens so that only molecules are found that have different substitutes at 
the "*" position?

I could not find a pre-defined function for that.

Thank you very much for any hints or solutions,

Best regards,

Alex



Best regards / Mit freundlichen Grüßen / Sincères salutations

Dr. Alexander Garvin Klenner-Bajaja
Administrator Requirements Engineering-Solution Design | Dir. 2.8.3.3
European Patent Office
Patentlaan 3-9 | 2288 EE Rijswijk | The Netherlands
Tel. +31(0)70340-1991
aklen...@epo.org
www.epo.org

Please consider the environment before printing this email.

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

2016-11-23 Thread Greg Landrum
Hi Alex,

The new version of the cartridge has some capabilities that, I think,
address this.

There's a blog post about this: http://rdkit.blogspot.com/
2016/07/tuning-substructure-queries-ii.html
but the short version is that you can do the kind of queries it seems like
you want to do quite simply:

chembl_21=# select * from rdk.mols where
m@>mol_adjust_query_properties('*c1ncccn1')
limit 3;
 molregno |   m

--+-
--
   601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1
   289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O
   607646 | CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)
cc4)ncnc32)[C@@H](O)[C@H]1O
(3 rows)

chembl_21=# select * from rdk.mols where
m@>mol_adjust_query_properties('*c1nc(*)ccn1')
limit 3;
 molregno |   m
--+---
   158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1
   158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1
   158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1
(3 rows)

chembl_21=# select * from rdk.mols where
m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1')
limit 3;
 molregno |m

--+-
-
   726443 | CN=C(S)NNc1nc(C)cc(C)n1
   561136 | C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)
cc1)c1(Br)c1
   205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1
(3 rows)

There's more detail in the blog post, but the default behavior is to
convert dummies into generic query atoms and to constrain the substitution
at any other *ring* position.

Best Regards,
-greg


On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja 
wrote:

> Hi all,
>
>
>
> I am currently exploring the possibilities of the RDKit database cartridge
> for substructure search- I installed everything following the  tutorial
> from http://www.rdkit.org/docs/Install.html
>
>
>
> Very nice tutorial  - worked perfectly fine.
>
>
>
> Since we are exploring solutions for browser based gui searches I created
> a test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/)
> which communicates with the database through PHP.
>
>
>
> Ketcher returns a SMILES representation from the drawn molecule. The raw
> data of the molecules in the database are canonical SMILES created from
> RDKIT canonical SMILES from the rdkit KNIME node (they are text-mined from
> patents).
>
>
>
> When doing substructure searches, as long as we query for well-defined
> compounds the results make sense – however looking at R1,…-groups things
> get a little odd.
>
>
>
> I found a very old discussion on the mailing list from 2009 where this has
> been discussed and I understood from that dialog that when looking at
> SMILES with a “*” representation this is interpreted as a dummy atom and
> the same dummy atom is expected in the search space to produce a hit. While
> a SMARTS representation of the same string actually leads to the behaviour
> that “any atom” is matched at that position.
>
>
>
> I ended up with the very cumbersome query, I am sure there are more
> elegant ways of doing this using ::qmol notation, but as I said I am
> currently exploring J
>
>
>
> That’s the query (in PHP) in question for PostgreSQL:
>
>
>
> *$search_result = pg_query($dbconn, "select m from pat.mols where
> m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "')))
> LIMIT 20;"); *
>
>
>
> Extracting rdkit functionality leaves me with:
>
>
>
> *m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles'].
> "')))*
>
> and adding a smiles string to make it more readable:
>
>
>
> *m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('* C([*])1=CC=CC=C1*')))
> (This is how Ketcher creates the smiles string, using explicit double
> bonds)*
>
>
>
> This query does actually work and returns structures that are correct
> (visually inspected a few examples)
>
>
>
> The same query without all the molecule conversion methods does not return
> anything
>
>
>
> *m@>'* C([*])1=CC=CC=C1*'*
>
>
>
> I guess the reason for this is that the default interpretation is smiles
> and it is looking for actual dummy atoms in the database (there are none).
>
>
>
> That’s my first question: Is this assumption correct?
>
>
>
> My next issue is a query with explicit hydrogens:
>
>
>
> Using
>
>
>
> *“C([*])1=C([H])C([H])=C([H])C([H])=C1[H]” *
>
>
>
> as a query with the all the molecule conversion as shown above to make
> SMARTS happen, returns among others:
>
>
>
> *“C(C)1=CC=C(C)C=C1”*
>
>
>
> Which is correct for implicit hydrogens but not for explicit – so my guess
> is they are lost.
>
>
>
> Can I enforce at query time against the cartridge to work with explicit
> hydrogens so that only molecules are found that have different substitutes
> at the 

Re: [Rdkit-discuss] Installation of RDKit on windows 7

2016-11-23 Thread Jean-Marc Nuzillard

Dear Riccardo,

I uninstalled Anaconda2 and reinstalled it for me alone (and not for all 
users
as I did initially) and the installation of rdkit completed without any 
trouble.

I was then able to import Chem from rdkit.

Thank you for suggesting that this global/personal installation of Anaconda
might be a source a trouble.

Best regards,

Jean-Marc

Le 22/11/2016 à 17:34, Riccardo Vianello a écrit :

Hi Jean-Marc,

On Tue, Nov 22, 2016 at 11:31 AM, Jean-Marc Nuzillard 
> wrote:


Hi,

I am currently attempting to install RDKit after a (forced)
re-installation of windows 7.
I decided to use Anaconda Python and to get RDKit using the conda
installer :

C:\Users\jmn>conda create -y -c https://conda.anaconda.org/rdkit
 -n my-rdkit-env rdkit >
rdkit_log.txt 2>&1
The -y flag was added in order to perform a non-interactive
(silent) installation.
Download was OK. Installation was not.
The edited version of the rdkit_log.txt file (attached) reflects
the messages in the console window
when conda is used in interactive mode.

It seems that this "lock" problem has to do with conda but not
with RDKit.


it definitely looks like a conda issue, but the message about the 
failed lock creation should be a warning, and the actual error at the 
end of the log file you sent seems to originate from a unicode issue 
(this kind of issues could be more easily explained if there were any 
accented letters in the user names or filesystem paths, but this 
doesn't seem to be the case)..


In general, installing Anaconda on windows for personal use (under the 
user's profile directory), doesn't require any admin rights and is 
often said to work more reliably, but in this case it doesn't seem to 
be a matter of privileges.


Did you try executing any other simple conda command, just to check 
that the installation is fully functional (something like 'conda list' 
or 'conda search pandas')?


I would also try to execute a 'conda update conda' (will require admin 
privileges given the current setup), just to make sure that the 
problem is still present in the latest version and not already fixed.

Best,
Riccardo


Has someone already experienced such a problem, and possibly
found a workaround?

Best regards,

Jean-Marc

PS. Sorry for the preceding empty message.


-- 
Jean-Marc Nuzillard

Institut de Chimie Moléculaire de Reims
CNRS UMR 7312
Moulin de la Housse
CPCBAI, Bâtiment 18
BP 1039
51687 REIMS Cedex 2
France

Tel : 03 26 91 82 10
Fax : 03 26 91 31 66
http://www.univ-reims.fr/ICMR

http://www.univ-reims.fr/LSD/
http://www.univ-reims.fr/LSD/JmnSoft/




--

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss






--
Jean-Marc Nuzillard
Institut de Chimie Moléculaire de Reims
CNRS UMR 7312
Moulin de la Housse
CPCBAI, Bâtiment 18
BP 1039
51687 REIMS Cedex 2
France

Tel : 03 26 91 82 10
Fax : 03 26 91 31 66
http://www.univ-reims.fr/ICMR

http://www.univ-reims.fr/LSD/
http://www.univ-reims.fr/LSD/JmnSoft/

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-23 Thread Greg Landrum
No worries.This, and Anna's question about similarity searching and clustering 
illustrate a great opportunity for a tutorial on fingerprints and similarity 
searching. 
-greg






On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain"  wrote:










Thanks for this,
As a chemist who comes from the “cut and paste” school of scripting I’m always 
concerned I’m asking something blindingly obvious
;-)
Chris
On 23 Nov 2016, at 12:36, Greg Landrum  wrote:
[including rdkit-discuss, because it's relevant there and I'm pretty sure Chris 
won't mind and the real Pandas experts may have a better answer than me.]

On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:


I quite like storing molecules and associated data in a data frame and I’ve see 
that it is possible to use rdkit for substructure searching, it is possible to 
also do similarity searching?

It's not built in since there are many possible fingerprints that could be used.
It's not quite as convenient as the substructure search, but here's a little 
demo of what you can do to filter based on similarity:
# Start by adding a fingerprint column:In [18]: df['mfp2'] = 
[rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) for x in df['ROMol']]

# and now filter:In [21]: ndf =df[df.apply(lambda x: 
DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
In [23]: len(df)
Out[23]: 1000In [24]: len(ndf)Out[24]: 2
-greg







--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

2016-11-23 Thread Markus Sitzmann
If I understood Greg correctly, it will be in 2016.09 which isn't in conda just 
of yet, they are currently working on putting it there.

Markus

-
|  Markus Sitzmann
|  markus.sitzm...@gmail.com

> On 23 Nov 2016, at 15:29, Alexander Klenner-Bajaja  wrote:
> 
> Dear Greg,
>  
> Thank you very much, looking at the results that function was exactly what I 
> was looking for – only I can’t find it in my updated anaconda installation.
>  
> “conda update rdkit” tells me I have the latest version 2016.03.4 and 
> postgres tells me I have the 3.4 version of the RDKit extension
>  
> If I understand your blog post correctly it should be in 2016.03 version? 
> What am I missing?
>  
>  
> Best,
>  
> Alex
>  
>  
>  
> From: Greg Landrum [mailto:greg.land...@gmail.com] 
> Sent: Wednesday, November 23, 2016 11:42 AM
> To: Alexander Klenner-Bajaja
> Cc: rdkit-discuss@lists.sourceforge.net
> Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit 
> hydrogens
>  
> Hi Alex,
>  
> The new version of the cartridge has some capabilities that, I think, address 
> this.
>  
> There's a blog post about this: 
> http://rdkit.blogspot.com/2016/07/tuning-substructure-queries-ii.html
> but the short version is that you can do the kind of queries it seems like 
> you want to do quite simply:
>  
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1ncccn1') limit 3;
>  molregno |   m   
> 
> --+---
>601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1
>289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O
>607646 | 
> CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc4)ncnc32)[C@@H](O)[C@H]1O
> (3 rows)
>  
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1nc(*)ccn1') limit 3;
>  molregno |   m   
> --+---
>158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1
>158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1
>158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1
> (3 rows)
>  
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') limit 3;
>  molregno |m  
>
> --+--
>726443 | CN=C(S)NNc1nc(C)cc(C)n1
>561136 | 
> C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1)c1(Br)c1
>205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1
> (3 rows)
>  
> There's more detail in the blog post, but the default behavior is to convert 
> dummies into generic query atoms and to constrain the substitution at any 
> other *ring* position.
>  
> Best Regards,
> -greg
>  
>  
> On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja  
> wrote:
> Hi all,
>  
> I am currently exploring the possibilities of the RDKit database cartridge 
> for substructure search- I installed everything following the  tutorial from 
> http://www.rdkit.org/docs/Install.html
>  
> Very nice tutorial  - worked perfectly fine.
>  
> Since we are exploring solutions for browser based gui searches I created a 
> test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) 
> which communicates with the database through PHP.
>  
> Ketcher returns a SMILES representation from the drawn molecule. The raw data 
> of the molecules in the database are canonical SMILES created from RDKIT 
> canonical SMILES from the rdkit KNIME node (they are text-mined from patents).
>  
> When doing substructure searches, as long as we query for well-defined 
> compounds the results make sense – however looking at R1,…-groups things get 
> a little odd.
>  
> I found a very old discussion on the mailing list from 2009 where this has 
> been discussed and I understood from that dialog that when looking at SMILES 
> with a “*” representation this is interpreted as a dummy atom and the same 
> dummy atom is expected in the search space to produce a hit. While a SMARTS 
> representation of the same string actually leads to the behaviour that “any 
> atom” is matched at that position.
>  
> I ended up with the very cumbersome query, I am sure there are more elegant 
> ways of doing this using ::qmol notation, but as I said I am currently 
> exploring J
>  
> That’s the query (in PHP) in question for PostgreSQL:
>  
> $search_result = pg_query($dbconn, "select m from pat.mols where 
> m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "'))) 
> LIMIT 20;");
>  
> Extracting rdkit functionality leaves me with:
>  
> 

Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

2016-11-23 Thread Greg Landrum
Alex,

I'm glad that looks right.
Unfortunately those changes are in the 2016.09 version of the RDKit, which
was just finalized today.
We haven't completed the anaconda builds for that yet.

-greg


On Wed, Nov 23, 2016 at 3:29 PM, Alexander Klenner-Bajaja 
wrote:

> Dear Greg,
>
>
>
> Thank you very much, looking at the results that function was exactly what
> I was looking for – only I can’t find it in my updated anaconda
> installation.
>
>
>
> “conda update rdkit” tells me I have the latest version 2016.03.4 and
> postgres tells me I have the 3.4 version of the RDKit extension
>
>
>
> If I understand your blog post correctly it should be in 2016.03 version?
> What am I missing?
>
>
>
>
>
> Best,
>
>
>
> Alex
>
>
>
>
>
>
>
> *From:* Greg Landrum [mailto:greg.land...@gmail.com]
> *Sent:* Wednesday, November 23, 2016 11:42 AM
> *To:* Alexander Klenner-Bajaja
> *Cc:* rdkit-discuss@lists.sourceforge.net
> *Subject:* Re: [Rdkit-discuss] smarts vs smiles database queries and
> explicit hydrogens
>
>
>
> Hi Alex,
>
>
>
> The new version of the cartridge has some capabilities that, I think,
> address this.
>
>
>
> There's a blog post about this: http://rdkit.blogspot.com/2016
> /07/tuning-substructure-queries-ii.html
>
> but the short version is that you can do the kind of queries it seems like
> you want to do quite simply:
>
>
>
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1ncccn1')
> limit 3;
>
>  molregno |   m
>
>
> --+-
> --
>
>601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1
>
>289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O
>
>607646 | CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc
> 4)ncnc32)[C@@H](O)[C@H]1O
>
> (3 rows)
>
>
>
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1nc(*)ccn1')
> limit 3;
>
>  molregno |   m
>
> --+---
>
>158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1
>
>158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1
>
>158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1
>
> (3 rows)
>
>
>
> chembl_21=# select * from rdk.mols where 
> m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1')
> limit 3;
>
>  molregno |m
>
>
> --+-
> -
>
>726443 | CN=C(S)NNc1nc(C)cc(C)n1
>
>561136 | C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1)
> c1(Br)c1
>
>205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1
>
> (3 rows)
>
>
>
> There's more detail in the blog post, but the default behavior is to
> convert dummies into generic query atoms and to constrain the substitution
> at any other *ring* position.
>
>
>
> Best Regards,
>
> -greg
>
>
>
>
>
> On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja <
> aklen...@epo.org> wrote:
>
> Hi all,
>
>
>
> I am currently exploring the possibilities of the RDKit database cartridge
> for substructure search- I installed everything following the  tutorial
> from http://www.rdkit.org/docs/Install.html
>
>
>
> Very nice tutorial  - worked perfectly fine.
>
>
>
> Since we are exploring solutions for browser based gui searches I created
> a test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/)
> which communicates with the database through PHP.
>
>
>
> Ketcher returns a SMILES representation from the drawn molecule. The raw
> data of the molecules in the database are canonical SMILES created from
> RDKIT canonical SMILES from the rdkit KNIME node (they are text-mined from
> patents).
>
>
>
> When doing substructure searches, as long as we query for well-defined
> compounds the results make sense – however looking at R1,…-groups things
> get a little odd.
>
>
>
> I found a very old discussion on the mailing list from 2009 where this has
> been discussed and I understood from that dialog that when looking at
> SMILES with a “*” representation this is interpreted as a dummy atom and
> the same dummy atom is expected in the search space to produce a hit. While
> a SMARTS representation of the same string actually leads to the behaviour
> that “any atom” is matched at that position.
>
>
>
> I ended up with the very cumbersome query, I am sure there are more
> elegant ways of doing this using ::qmol notation, but as I said I am
> currently exploring J
>
>
>
> That’s the query (in PHP) in question for PostgreSQL:
>
>
>
> *$search_result = pg_query($dbconn, "select m from pat.mols where
> m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles
> '].
> "'))) LIMIT 20;"); *
>
>
>
> Extracting rdkit functionality leaves me with:
>
>
>
> 

Re: [Rdkit-discuss] Pandas

2016-11-23 Thread Peter Gedeck
Is it possible to use the bulk similarity searching functionality for
better performance instead of the list comprehension?

Best,

Peter


On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum  wrote:

No worries.
This, and Anna's question about similarity searching and clustering
illustrate a great opportunity for a tutorial on fingerprints and
similarity searching.

-greg





On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain"  wrote:

Thanks for this,

As a chemist who comes from the “cut and paste” school of scripting I’m
always concerned I’m asking something blindingly obvious

;-)

Chris

On 23 Nov 2016, at 12:36, Greg Landrum  wrote:

[including rdkit-discuss, because it's relevant there and I'm pretty sure
Chris won't mind and the real Pandas experts may have a better answer than
me.]

On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:


I quite like storing molecules and associated data in a data frame and I’ve
see that it is possible to use rdkit for substructure searching, it is
possible to also do similarity searching?


It's not built in since there are many possible fingerprints that could be
used.

It's not quite as convenient as the substructure search, but here's a
little demo of what you can do to filter based on similarity:

# Start by adding a fingerprint column:
In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
for x in df['ROMol']]

# and now filter:
In [21]: ndf =df[df.apply(lambda x:
DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]

In [23]: len(df)
Out[23]: 1000
In [24]: len(ndf)
Out[24]: 2

-greg


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

2016-11-23 Thread Alexander Klenner-Bajaja
Thank you both Greg & Markus – I`ll happily wait for it to appear in conda in 
the near future ☺

Alex

From: Markus Sitzmann [mailto:markus.sitzm...@gmail.com]
Sent: Wednesday, November 23, 2016 3:40 PM
To: Alexander Klenner-Bajaja
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit 
hydrogens

If I understood Greg correctly, it will be in 2016.09 which isn't in conda just 
of yet, they are currently working on putting it there.

Markus
-
|  Markus Sitzmann
|  markus.sitzm...@gmail.com

On 23 Nov 2016, at 15:29, Alexander Klenner-Bajaja 
> wrote:
Dear Greg,

Thank you very much, looking at the results that function was exactly what I 
was looking for – only I can’t find it in my updated anaconda installation.

“conda update rdkit” tells me I have the latest version 2016.03.4 and postgres 
tells me I have the 3.4 version of the RDKit extension

If I understand your blog post correctly it should be in 2016.03 version? What 
am I missing?


Best,

Alex



From: Greg Landrum [mailto:greg.land...@gmail.com]
Sent: Wednesday, November 23, 2016 11:42 AM
To: Alexander Klenner-Bajaja
Cc: 
rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit 
hydrogens

Hi Alex,

The new version of the cartridge has some capabilities that, I think, address 
this.

There's a blog post about this: 
http://rdkit.blogspot.com/2016/07/tuning-substructure-queries-ii.html
but the short version is that you can do the kind of queries it seems like you 
want to do quite simply:

chembl_21=# select * from rdk.mols where 
m@>mol_adjust_query_properties('*c1ncccn1') limit 3;
 molregno |   m
--+---
   601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1
   289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O
   607646 | 
CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc4)ncnc32)[C@@H](O)[C@H]1O
(3 rows)

chembl_21=# select * from rdk.mols where 
m@>mol_adjust_query_properties('*c1nc(*)ccn1') limit 3;
 molregno |   m
--+---
   158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1
   158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1
   158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1
(3 rows)

chembl_21=# select * from rdk.mols where 
m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') limit 3;
 molregno |m
--+--
   726443 | CN=C(S)NNc1nc(C)cc(C)n1
   561136 | 
C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1)c1(Br)c1
   205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1
(3 rows)

There's more detail in the blog post, but the default behavior is to convert 
dummies into generic query atoms and to constrain the substitution at any other 
*ring* position.

Best Regards,
-greg


On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja 
> wrote:
Hi all,

I am currently exploring the possibilities of the RDKit database cartridge for 
substructure search- I installed everything following the  tutorial from 
http://www.rdkit.org/docs/Install.html

Very nice tutorial  - worked perfectly fine.

Since we are exploring solutions for browser based gui searches I created a 
test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) which 
communicates with the database through PHP.

Ketcher returns a SMILES representation from the drawn molecule. The raw data 
of the molecules in the database are canonical SMILES created from RDKIT 
canonical SMILES from the rdkit KNIME node (they are text-mined from patents).

When doing substructure searches, as long as we query for well-defined 
compounds the results make sense – however looking at R1,…-groups things get a 
little odd.

I found a very old discussion on the mailing list from 2009 where this has been 
discussed and I understood from that dialog that when looking at SMILES with a 
“*” representation this is interpreted as a dummy atom and the same dummy atom 
is expected in the search space to produce a hit. While a SMARTS representation 
of the same string actually leads to the behaviour that “any atom” is matched 
at that position.

I ended up with the very cumbersome query, I am sure there are more elegant 
ways of doing this using ::qmol notation, but as I said I am currently 
exploring ☺

That’s the query (in PHP) in question for PostgreSQL:

$search_result = pg_query($dbconn, "select m from pat.mols where 

Re: [Rdkit-discuss] Pandas

2016-11-23 Thread Brian Kelley
Peter,
  If you have chemfp and can make a chemfp arena, RDKit now supports these
structures for reading and searching.  This, by far, is the fastest way I
know of similarity searching.  I believe that Greg's implementation is
compatible with chemfp 1.0 which is available on pypi:

https://pypi.python.org/pypi/chemfp/1.0

In my copious spare time, I've been trying to think of ways to embed this
directly in a pandas dataframe however, using them side by side is
certainly doable.

Cheers,
 Brian


On Wed, Nov 23, 2016 at 10:06 AM, Peter Gedeck 
wrote:

> Is it possible to use the bulk similarity searching functionality for
> better performance instead of the list comprehension?
>
> Best,
>
> Peter
>
>
> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum 
> wrote:
>
> No worries.
> This, and Anna's question about similarity searching and clustering
> illustrate a great opportunity for a tutorial on fingerprints and
> similarity searching.
>
> -greg
>
>
>
>
>
> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" 
> wrote:
>
> Thanks for this,
>
> As a chemist who comes from the “cut and paste” school of scripting I’m
> always concerned I’m asking something blindingly obvious
>
> ;-)
>
> Chris
>
> On 23 Nov 2016, at 12:36, Greg Landrum  wrote:
>
> [including rdkit-discuss, because it's relevant there and I'm pretty sure
> Chris won't mind and the real Pandas experts may have a better answer than
> me.]
>
> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:
>
>
> I quite like storing molecules and associated data in a data frame and
> I’ve see that it is possible to use rdkit for substructure searching, it is
> possible to also do similarity searching?
>
>
> It's not built in since there are many possible fingerprints that could be
> used.
>
> It's not quite as convenient as the substructure search, but here's a
> little demo of what you can do to filter based on similarity:
>
> # Start by adding a fingerprint column:
> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
> for x in df['ROMol']]
>
> # and now filter:
> In [21]: ndf =df[df.apply(lambda x: DataStructs.
> TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
>
> In [23]: len(df)
> Out[23]: 1000
> In [24]: len(ndf)
> Out[24]: 2
>
> -greg
>
>
> 
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss