[Rdkit-discuss] PAINS

2013-04-25 Thread Nicholas Firth
Hi RDKitters,

I've just been made aware that the PAINS filters have been nicely put into 
SMARTS (http://www.macinchem.org/reviews/pains/painsFilter.php), my question is 
how do people think would be the most efficient way to implement this in RDKit?

I don't have much experience with SMARTS matching in RDKit, but I thought I 
could create a list of molecules from SMARTS and iterate over these using the 
'HasSubstructMatch()' function to show any failures. Does this sound sensible 
or is there a more efficient way?

Best,
Nick

Nicholas C. Firth | PhD Student | Cancer Therapeutics
The Institute of Cancer Research | 15 Cotswold Road | Belmont | Sutton | Surrey 
| SM2 5NG
T 020 8722 4033 | E nicholas.fi...@icr.ac.ukmailto:nicholas.fi...@icr.ac.uk | 
W www.icr.ac.ukhttp://www.icr.ac.uk/ | Twitter 
@ICRnewshttps://twitter.com/ICRnews
Facebook 
www.facebook.com/theinstituteofcancerresearchhttp://www.facebook.com/theinstituteofcancerresearch
Making the discoveries that defeat cancer

[cid:image001.gif@01CE053D.51D3C4E0]


The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company 
Limited by Guarantee, Registered in England under Company No. 534147 with its 
Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only.  If the 
message is received by anyone other than the addressee, please return the 
message to the sender by replying to it and then delete the message from your 
computer and network.inline: image001.gif--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] PAINS

2013-04-25 Thread Markus Hartenfeller

  
  
Hi Nick,

Here is some code to get you started:


In:

from rdkit import Chem
  from rdkit.Chem import AllChem
  
  molList = ['Nc1cc(C)cc(Br)c1','CNc1cc(C)cc(Cl)c1', 'CCC']
  
  #create screening dataset
  mol_DB = [Chem.MolFromSmiles(m) for m in molList]
  
  
  #create query mol from smarts 
  try: 
   substr = AllChem.MolFromSmarts('[#7;H2]c1c1') 
  
  except:
   print 'error parsing smarts query'
   exit(1)
  
  #screen
  for m in mol_DB:
   try:
   if (m.HasSubstructMatch(substr)): 
   print Chem.MolToSmiles(m) 
   except:
   print 'error during substructure matching with mol
  '+Chem.MolToSmiles(m)
   continue


Out:
Cc1cc(N)cc(Br)c1


Is it that what you are looking for? You can find more examples
here:

http://www.rdkit.org/docs/GettingStartedInPython.html


Cheers,
Markus

On 04/25/2013 02:36 PM, Nicholas Firth wrote:

  
  Hi RDKitters,
  
  
  I've just been made aware that the PAINS filters have been
nicely put into SMARTS (http://www.macinchem.org/reviews/pains/painsFilter.php),
my question is how do people think would be the most efficient
way to implement this in RDKit?
  
  
  I don't have much experience with SMARTS matching in RDKit,
but I thought I could create a list of molecules from SMARTS and
iterate over these using the 'HasSubstructMatch()' function to
show any failures. Does this sound sensible or is there a more
efficient way?

  
  Best,
  Nick
  

  Nicholas C. Firth| PhD
  Student | Cancer Therapeutics
  The
  Institute of Cancer Research | 15 Cotswold Road | Belmont
  | Sutton | Surrey | SM2 5NG
  T020
  8722 4033|Enicholas.fi...@icr.ac.uk|Wwww.icr.ac.uk|Twitter@ICRnews
  Facebookwww.facebook.com/theinstituteofcancerresearch
  Making the discoveries that
defeat cancer
  
  



  
  
  The Institute of Cancer Research: Royal Cancer Hospital, a
  charitable Company Limited by Guarantee, Registered in England
  under Company No. 534147 with its Registered Office at 123 Old
  Brompton Road, London SW7 3RP.
  
  This e-mail message is confidential and for use by the addressee
  only. If the message is received by anyone other than the
  addressee, please return the message to the sender by replying to
  it and then delete the message from your computer and network.
  
  
  
  --
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
  
  
  
  ___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


  

--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] PAINS

2013-04-25 Thread Markus Hartenfeller

  
  
Ah OK, I see ;) 

This I don't know, sorry.

I have implemented a de novo design tool based on rdkit that made
heavy use of the smarts matching accessed via Python. From my
experience the substructure matching will probably not be the
performance bottleneck (depending on how sophisticated your scoring
functions is, of course). 

Something you might want to consider regarding the performance of
SMARTS queries is the SMARTS themselves. This is a quote from the
daylight page
http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html:

  
  Efficiency Considerations
  The Daylight 4.x SMARTS Toolkit provides a function,
  dt_smarts_opt(),
  which automatically optimizes a SMARTS by reordering, expanding,
  and/or
  consolidating atom and bond expressions. Programs which use this
  feature (e.g.
  the Merlin program) can be expected to be near optimal in terms of
  the time
  used to search typical organic structures.
  
  When this optimization method is not used, there are some things
  which can be
  done to facilitate efficient (fast) searching operations using
  SMARTS. It is
  important to recognize that SMARTS target strings are processed in
  strictly
  left-to-right order. For this reason, substantial gains in speed
  can be
  achieved by following these guidelines:




   Uncommon atoms or bond arrangements
  should be placed early in SMARTS targets. 
   
   In an "and-_expression_", the less
  common atom or bond specifications should be placed early. 
  
  
  
  In an "or-_expression_", the less common atom or bond
  specifications
  should be placed last.




I understand that the SMARTS you want to use have already been
designed, and of course what is stated by daylight refers to the
their implementation. But (a) I think it applies to the rdkit
implementation as well (please correct me if I'm wrong here, Greg)
and (b) in case performance is really critical this could be another
point to look at.

Best,
Markus


On 04/25/2013 04:13 PM, Nicholas Firth wrote:

  
  Hi Markus,
  
  
  Thanks for the quick reply, I don't think I worded my
question very well though.
  
  
  I would like to know which is the most efficient way to
implement the SMARTS queries, using the way I suggested, an
alternative way in Python or piping into the C++ side of things.
The reason being is that I work on de novo design and I generate
a lot of molecules, so the most efficient way is quite
important.
  
  
  

  Sorry for the
  confusion,
  Best,
  Nick
  

  Nicholas C. Firth| PhD
  Student | Cancer Therapeutics
  The
  Institute of Cancer Research | 15 Cotswold Road | Belmont
  | Sutton | Surrey | SM2 5NG
  T020
  8722 4033|Enicholas.fi...@icr.ac.uk|Wwww.icr.ac.uk|Twitter@ICRnews
  Facebookwww.facebook.com/theinstituteofcancerresearch
  Making the discoveries that
defeat cancer
  
  




  On 25 Apr 2013, at 14:52, Markus Hartenfeller markus.hartenfel...@molecularhealth.com
wrote:
  
  m.HasSubstructMatch

  
  
  The Institute of Cancer Research: Royal Cancer Hospital, a
  charitable Company Limited by Guarantee, Registered in England
  under Company No. 534147 with its Registered Office at 123 Old
  Brompton Road, London SW7 3RP.
  
  This e-mail message is confidential and for use by the addressee
  only. If the message is received by anyone other than the
  addressee, please return the message to the sender by replying to
  it and then delete the message from your computer and network.


-- 
  Markus Hartenfeller
  Chemoinformatics Specialist
  Molecular Health GmbH
  Belfortstr. 2
  69115 Heidelberg
  Germany
  Tel: +49 6221 43851 209
  Fax: +49 6221 43851 100
  Email: markus.hartenfel...@molecularhealth.com
  www.molecularhealth.com
  
  --
  Molecular Health GmbH
  
  Geschaeftsfuehrer: Dr. Stephan Brock/
  Dr. Friedrich von Bohlen und Halbach
  
  Sitz der Gesellschaft: Heidelberg
  Handelsregister: Amtsgericht Mannheim - HRB 338037
  --

  

--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based 

[Rdkit-discuss] Building rdkit on Ubuntu 12.10

2013-04-25 Thread hari jayaram
Hi
I did a
export RDBASE=/home/hari/RDKit_2012_09_1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$RDBASE/lib

Then cd into the build directory
Run cmake ..
Then run make

At around 24% I get the following error ( see below)

I installed the Ubuntu blessed libboost-python1.49-dev

Any ideas how to get around this. On a related noted the Ubuntu synaptic
package repository did have a rdkit library but it does not work and
complains

 from rdkit import Chem
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/local/lib/python2.7/dist-packages/rdkit/Chem/__init__.py,
line 18, in module
from rdkit import rdBase
ImportError: cannot import name rdBase

Any idea how to get my oqn build to work or get the synaptic provided build
to work

Thanks a tonne
Hari





Linking CXX static library libCatalogs_static.a
[ 24%] Built target Catalogs_static
Scanning dependencies of target GraphMol
[ 25%] Building CXX object Code/GraphMol/CMakeFiles/GraphMol.dir/Atom.cpp.o
[ 25%] Building CXX object
Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o
In file included from
/usr/local/include/boost/thread/detail/platform.hpp:17:0,
 from /usr/local/include/boost/thread/mutex.hpp:12,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
/usr/local/include/boost/config/requires_threads.hpp:29:4: error: #error
Threading support unavaliable: it has been explicitly disabled with
BOOST_DISABLE_THREADS
In file included from /usr/local/include/boost/thread/mutex.hpp:12:0,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
/usr/local/include/boost/thread/detail/platform.hpp:67:9: error: #error
Sorry, no boost threads are available for this platform.
In file included from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20:0,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
/usr/local/include/boost/thread/mutex.hpp:18:2: error: #error Boost
threads unavailable on this platform
In file included from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15:0,
 from
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:313:5: error: ‘mutex’
in namespace ‘boost’ does not name a type
make[2]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o] Error 1
make[1]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/all] Error 2
make: *** [all] Error 2
--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] PAINS

2013-04-25 Thread Simon Saubern
Nicholas, this is an O(n^2) problem (many-to-many) and difficult to make 
efficient. It is, however, 'embarrassingly parallel' so you can take 
advantage of multiple cores.

Have a look at how these 2 KNIME workflows implement the PAINS filters 
with the RDKit nodes in KNIME:

http://www.myexperiment.org/workflows/1841.html
http://www.myexperiment.org/workflows/2485.html

-- 

Cheers,

Simon


--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Building rdkit on Ubuntu 12.10

2013-04-25 Thread Paul Emsley
On 25/04/13 23:43, hari jayaram wrote:
 Hi
 I did a
 export RDBASE=/home/hari/RDKit_2012_09_1
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$RDBASE/lib

 Then cd into the build directory
 Run cmake ..
 Then run make

 At around 24% I get the following error ( see below)

 I installed the Ubuntu blessed libboost-python1.49-dev

 Any ideas how to get around this. On a related noted the Ubuntu 
 synaptic package repository did have a rdkit library but it does not 
 work and complains

  from rdkit import Chem
 Traceback (most recent call last):
 File stdin, line 1, in module
 File /usr/local/lib/python2.7/dist-packages/rdkit/Chem/__init__.py, 
 line 18, in module
 from rdkit import rdBase
 ImportError: cannot import name rdBase

Hmm... AFAIR, synaptic packages should not use files in /usr/local. Do 
you still get the same problem if you have not defined PYTHONPATH, 
PYTHONHOME or LD_LIBRARY_PATH?




 Linking CXX static library libCatalogs_static.a
 [ 24%] Built target Catalogs_static
 Scanning dependencies of target GraphMol
 [ 25%] Building CXX object 
 Code/GraphMol/CMakeFiles/GraphMol.dir/Atom.cpp.o
 [ 25%] Building CXX object 
 Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o
 In file included from 
 /usr/local/include/boost/thread/detail/platform.hpp:17:0,
 from /usr/local/include/boost/thread/mutex.hpp:12,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
 /usr/local/include/boost/config/requires_threads.hpp:29:4: error: 
 #error Threading support unavaliable: it has been explicitly disabled 
 with BOOST_DISABLE_THREADS
 In file included from /usr/local/include/boost/thread/mutex.hpp:12:0,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
 /usr/local/include/boost/thread/detail/platform.hpp:67:9: error: 
 #error Sorry, no boost threads are available for this platform.
 In file included from 
 /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20:0,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
 /usr/local/include/boost/thread/mutex.hpp:18:2: error: #error Boost 
 threads unavailable on this platform
 In file included from 
 /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15:0,
 from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11:
 /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:313:5: error: 
 ‘mutex’ in namespace ‘boost’ does not name a type
 make[2]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o] 
 Error 1
 make[1]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/all] Error 2
 make: *** [all] Error 2




There are 2 issues here:

1) Threading support unavaliable: it has been explicitly disabled with 
BOOST_DISABLE_THREADS (Someone at Boost Central can't spell :-) I doubt 
that compiling Boost without threads was a good idea.

2) and as this the above is unusual, this error: 
/home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:313:5: error: 
‘mutex’ in namespace ‘boost’ does not name a type was not trapped 
before. Ideally QueryOps (or anything else) should not use boost mutex 
if threads have been disabled (this might be a pain to code up).


HTH,

Paul.




--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] PAINS

2013-04-25 Thread Greg Landrum
Hi,


On Thu, Apr 25, 2013 at 4:44 PM, Markus Hartenfeller 
markus.hartenfel...@molecularhealth.com wrote:

  Ah OK, I see ;)

 This I don't know, sorry.

 I have implemented a de novo design tool based on rdkit that made heavy
 use of the smarts matching accessed via Python. From my experience the
 substructure matching will probably not be the performance bottleneck
 (depending on how sophisticated your scoring functions is, of course).


I think Markus has it right: the substructure matching is unlikely to be
the bottleneck. As long as you construct the query molecules via
MolFromSmarts in advance (outside the loop), you should be fine.
If you do find that you're spending lots of time waiting for the
substructure matcher, you can try moving the whole PAINS bit into C++. This
wouldn't be overly complex and would make taking advantage of the
embarrassingly parallel nature of the problem (mentioned by Simon in
another message) easier.

Something you might want to consider regarding the performance of SMARTS
 queries is the SMARTS themselves. This is a quote from the daylight page
 http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html:


 Efficiency Considerations
 The Daylight 4.x SMARTS Toolkit provides a function, dt_smarts_opt(),
 which automatically optimizes a SMARTS by reordering, expanding, and/or
 consolidating atom and bond expressions. Programs which use this feature
 (e.g. the Merlin program) can be expected to be near optimal in terms of
 the time used to search typical organic structures.
 When this optimization method is not used, there are some things which can
 be done to facilitate efficient (fast) searching operations using SMARTS.
 It is important to recognize that SMARTS target strings are processed in
 strictly left-to-right order. For this reason, substantial gains in speed
 can be achieved by following these guidelines:


- Uncommon atoms or bond arrangements should be placed early in SMARTS
targets.

 - In an and-expression, the less common atom or bond specifications
should be placed early.

 - In an or-expression, the less common atom or bond specifications
should be placed last.



 I understand that the SMARTS you want to use have already been designed,
 and of course what is stated by daylight refers to the their
 implementation. But (a) I think it applies to the rdkit implementation as
 well (please correct me if I'm wrong here, Greg) and (b) in case
 performance is really critical this could be another point to look at.


The form of the SMARTS definitely matters. Neither SmartsToMol nor the
substructure matcher makes any attempt to optimize queries (aside from
recognizing repeated recursive SMARTS expressions), so the advice above is
quite good.

-greg
--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss