[Rdkit-discuss] PAINS
Hi RDKitters, I've just been made aware that the PAINS filters have been nicely put into SMARTS (http://www.macinchem.org/reviews/pains/painsFilter.php), my question is how do people think would be the most efficient way to implement this in RDKit? I don't have much experience with SMARTS matching in RDKit, but I thought I could create a list of molecules from SMARTS and iterate over these using the 'HasSubstructMatch()' function to show any failures. Does this sound sensible or is there a more efficient way? Best, Nick Nicholas C. Firth | PhD Student | Cancer Therapeutics The Institute of Cancer Research | 15 Cotswold Road | Belmont | Sutton | Surrey | SM2 5NG T 020 8722 4033 | E nicholas.fi...@icr.ac.ukmailto:nicholas.fi...@icr.ac.uk | W www.icr.ac.ukhttp://www.icr.ac.uk/ | Twitter @ICRnewshttps://twitter.com/ICRnews Facebook www.facebook.com/theinstituteofcancerresearchhttp://www.facebook.com/theinstituteofcancerresearch Making the discoveries that defeat cancer [cid:image001.gif@01CE053D.51D3C4E0] The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network.inline: image001.gif-- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] PAINS
Hi Nick, Here is some code to get you started: In: from rdkit import Chem from rdkit.Chem import AllChem molList = ['Nc1cc(C)cc(Br)c1','CNc1cc(C)cc(Cl)c1', 'CCC'] #create screening dataset mol_DB = [Chem.MolFromSmiles(m) for m in molList] #create query mol from smarts try: substr = AllChem.MolFromSmarts('[#7;H2]c1c1') except: print 'error parsing smarts query' exit(1) #screen for m in mol_DB: try: if (m.HasSubstructMatch(substr)): print Chem.MolToSmiles(m) except: print 'error during substructure matching with mol '+Chem.MolToSmiles(m) continue Out: Cc1cc(N)cc(Br)c1 Is it that what you are looking for? You can find more examples here: http://www.rdkit.org/docs/GettingStartedInPython.html Cheers, Markus On 04/25/2013 02:36 PM, Nicholas Firth wrote: Hi RDKitters, I've just been made aware that the PAINS filters have been nicely put into SMARTS (http://www.macinchem.org/reviews/pains/painsFilter.php), my question is how do people think would be the most efficient way to implement this in RDKit? I don't have much experience with SMARTS matching in RDKit, but I thought I could create a list of molecules from SMARTS and iterate over these using the 'HasSubstructMatch()' function to show any failures. Does this sound sensible or is there a more efficient way? Best, Nick Nicholas C. Firth| PhD Student | Cancer Therapeutics The Institute of Cancer Research | 15 Cotswold Road | Belmont | Sutton | Surrey | SM2 5NG T020 8722 4033|Enicholas.fi...@icr.ac.uk|Wwww.icr.ac.uk|Twitter@ICRnews Facebookwww.facebook.com/theinstituteofcancerresearch Making the discoveries that defeat cancer The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] PAINS
Ah OK, I see ;) This I don't know, sorry. I have implemented a de novo design tool based on rdkit that made heavy use of the smarts matching accessed via Python. From my experience the substructure matching will probably not be the performance bottleneck (depending on how sophisticated your scoring functions is, of course). Something you might want to consider regarding the performance of SMARTS queries is the SMARTS themselves. This is a quote from the daylight page http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html: Efficiency Considerations The Daylight 4.x SMARTS Toolkit provides a function, dt_smarts_opt(), which automatically optimizes a SMARTS by reordering, expanding, and/or consolidating atom and bond expressions. Programs which use this feature (e.g. the Merlin program) can be expected to be near optimal in terms of the time used to search typical organic structures. When this optimization method is not used, there are some things which can be done to facilitate efficient (fast) searching operations using SMARTS. It is important to recognize that SMARTS target strings are processed in strictly left-to-right order. For this reason, substantial gains in speed can be achieved by following these guidelines: Uncommon atoms or bond arrangements should be placed early in SMARTS targets. In an "and-_expression_", the less common atom or bond specifications should be placed early. In an "or-_expression_", the less common atom or bond specifications should be placed last. I understand that the SMARTS you want to use have already been designed, and of course what is stated by daylight refers to the their implementation. But (a) I think it applies to the rdkit implementation as well (please correct me if I'm wrong here, Greg) and (b) in case performance is really critical this could be another point to look at. Best, Markus On 04/25/2013 04:13 PM, Nicholas Firth wrote: Hi Markus, Thanks for the quick reply, I don't think I worded my question very well though. I would like to know which is the most efficient way to implement the SMARTS queries, using the way I suggested, an alternative way in Python or piping into the C++ side of things. The reason being is that I work on de novo design and I generate a lot of molecules, so the most efficient way is quite important. Sorry for the confusion, Best, Nick Nicholas C. Firth| PhD Student | Cancer Therapeutics The Institute of Cancer Research | 15 Cotswold Road | Belmont | Sutton | Surrey | SM2 5NG T020 8722 4033|Enicholas.fi...@icr.ac.uk|Wwww.icr.ac.uk|Twitter@ICRnews Facebookwww.facebook.com/theinstituteofcancerresearch Making the discoveries that defeat cancer On 25 Apr 2013, at 14:52, Markus Hartenfeller markus.hartenfel...@molecularhealth.com wrote: m.HasSubstructMatch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. -- Markus Hartenfeller Chemoinformatics Specialist Molecular Health GmbH Belfortstr. 2 69115 Heidelberg Germany Tel: +49 6221 43851 209 Fax: +49 6221 43851 100 Email: markus.hartenfel...@molecularhealth.com www.molecularhealth.com -- Molecular Health GmbH Geschaeftsfuehrer: Dr. Stephan Brock/ Dr. Friedrich von Bohlen und Halbach Sitz der Gesellschaft: Heidelberg Handelsregister: Amtsgericht Mannheim - HRB 338037 -- -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based
[Rdkit-discuss] Building rdkit on Ubuntu 12.10
Hi I did a export RDBASE=/home/hari/RDKit_2012_09_1 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$RDBASE/lib Then cd into the build directory Run cmake .. Then run make At around 24% I get the following error ( see below) I installed the Ubuntu blessed libboost-python1.49-dev Any ideas how to get around this. On a related noted the Ubuntu synaptic package repository did have a rdkit library but it does not work and complains from rdkit import Chem Traceback (most recent call last): File stdin, line 1, in module File /usr/local/lib/python2.7/dist-packages/rdkit/Chem/__init__.py, line 18, in module from rdkit import rdBase ImportError: cannot import name rdBase Any idea how to get my oqn build to work or get the synaptic provided build to work Thanks a tonne Hari Linking CXX static library libCatalogs_static.a [ 24%] Built target Catalogs_static Scanning dependencies of target GraphMol [ 25%] Building CXX object Code/GraphMol/CMakeFiles/GraphMol.dir/Atom.cpp.o [ 25%] Building CXX object Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o In file included from /usr/local/include/boost/thread/detail/platform.hpp:17:0, from /usr/local/include/boost/thread/mutex.hpp:12, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /usr/local/include/boost/config/requires_threads.hpp:29:4: error: #error Threading support unavaliable: it has been explicitly disabled with BOOST_DISABLE_THREADS In file included from /usr/local/include/boost/thread/mutex.hpp:12:0, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /usr/local/include/boost/thread/detail/platform.hpp:67:9: error: #error Sorry, no boost threads are available for this platform. In file included from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20:0, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /usr/local/include/boost/thread/mutex.hpp:18:2: error: #error Boost threads unavailable on this platform In file included from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15:0, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:313:5: error: ‘mutex’ in namespace ‘boost’ does not name a type make[2]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o] Error 1 make[1]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/all] Error 2 make: *** [all] Error 2 -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] PAINS
Nicholas, this is an O(n^2) problem (many-to-many) and difficult to make efficient. It is, however, 'embarrassingly parallel' so you can take advantage of multiple cores. Have a look at how these 2 KNIME workflows implement the PAINS filters with the RDKit nodes in KNIME: http://www.myexperiment.org/workflows/1841.html http://www.myexperiment.org/workflows/2485.html -- Cheers, Simon -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Building rdkit on Ubuntu 12.10
On 25/04/13 23:43, hari jayaram wrote: Hi I did a export RDBASE=/home/hari/RDKit_2012_09_1 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$RDBASE/lib Then cd into the build directory Run cmake .. Then run make At around 24% I get the following error ( see below) I installed the Ubuntu blessed libboost-python1.49-dev Any ideas how to get around this. On a related noted the Ubuntu synaptic package repository did have a rdkit library but it does not work and complains from rdkit import Chem Traceback (most recent call last): File stdin, line 1, in module File /usr/local/lib/python2.7/dist-packages/rdkit/Chem/__init__.py, line 18, in module from rdkit import rdBase ImportError: cannot import name rdBase Hmm... AFAIR, synaptic packages should not use files in /usr/local. Do you still get the same problem if you have not defined PYTHONPATH, PYTHONHOME or LD_LIBRARY_PATH? Linking CXX static library libCatalogs_static.a [ 24%] Built target Catalogs_static Scanning dependencies of target GraphMol [ 25%] Building CXX object Code/GraphMol/CMakeFiles/GraphMol.dir/Atom.cpp.o [ 25%] Building CXX object Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o In file included from /usr/local/include/boost/thread/detail/platform.hpp:17:0, from /usr/local/include/boost/thread/mutex.hpp:12, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /usr/local/include/boost/config/requires_threads.hpp:29:4: error: #error Threading support unavaliable: it has been explicitly disabled with BOOST_DISABLE_THREADS In file included from /usr/local/include/boost/thread/mutex.hpp:12:0, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /usr/local/include/boost/thread/detail/platform.hpp:67:9: error: #error Sorry, no boost threads are available for this platform. In file included from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:20:0, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /usr/local/include/boost/thread/mutex.hpp:18:2: error: #error Boost threads unavailable on this platform In file included from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.h:15:0, from /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryAtom.cpp:11: /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:313:5: error: ‘mutex’ in namespace ‘boost’ does not name a type make[2]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/QueryAtom.cpp.o] Error 1 make[1]: *** [Code/GraphMol/CMakeFiles/GraphMol.dir/all] Error 2 make: *** [all] Error 2 There are 2 issues here: 1) Threading support unavaliable: it has been explicitly disabled with BOOST_DISABLE_THREADS (Someone at Boost Central can't spell :-) I doubt that compiling Boost without threads was a good idea. 2) and as this the above is unusual, this error: /home/hari/RDKit_2012_09_1/Code/GraphMol/QueryOps.h:313:5: error: ‘mutex’ in namespace ‘boost’ does not name a type was not trapped before. Ideally QueryOps (or anything else) should not use boost mutex if threads have been disabled (this might be a pain to code up). HTH, Paul. -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] PAINS
Hi, On Thu, Apr 25, 2013 at 4:44 PM, Markus Hartenfeller markus.hartenfel...@molecularhealth.com wrote: Ah OK, I see ;) This I don't know, sorry. I have implemented a de novo design tool based on rdkit that made heavy use of the smarts matching accessed via Python. From my experience the substructure matching will probably not be the performance bottleneck (depending on how sophisticated your scoring functions is, of course). I think Markus has it right: the substructure matching is unlikely to be the bottleneck. As long as you construct the query molecules via MolFromSmarts in advance (outside the loop), you should be fine. If you do find that you're spending lots of time waiting for the substructure matcher, you can try moving the whole PAINS bit into C++. This wouldn't be overly complex and would make taking advantage of the embarrassingly parallel nature of the problem (mentioned by Simon in another message) easier. Something you might want to consider regarding the performance of SMARTS queries is the SMARTS themselves. This is a quote from the daylight page http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html: Efficiency Considerations The Daylight 4.x SMARTS Toolkit provides a function, dt_smarts_opt(), which automatically optimizes a SMARTS by reordering, expanding, and/or consolidating atom and bond expressions. Programs which use this feature (e.g. the Merlin program) can be expected to be near optimal in terms of the time used to search typical organic structures. When this optimization method is not used, there are some things which can be done to facilitate efficient (fast) searching operations using SMARTS. It is important to recognize that SMARTS target strings are processed in strictly left-to-right order. For this reason, substantial gains in speed can be achieved by following these guidelines: - Uncommon atoms or bond arrangements should be placed early in SMARTS targets. - In an and-expression, the less common atom or bond specifications should be placed early. - In an or-expression, the less common atom or bond specifications should be placed last. I understand that the SMARTS you want to use have already been designed, and of course what is stated by daylight refers to the their implementation. But (a) I think it applies to the rdkit implementation as well (please correct me if I'm wrong here, Greg) and (b) in case performance is really critical this could be another point to look at. The form of the SMARTS definitely matters. Neither SmartsToMol nor the substructure matcher makes any attempt to optimize queries (aside from recognizing repeated recursive SMARTS expressions), so the advice above is quite good. -greg -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss