Re: [Rdkit-discuss] RDKit workflow in KNIME
Stuart, The PAINS file is available in the RDKIT Github repository. If that's too complicated deal with at this early stage, try some of the workflows on myexperiment.org: http://www.myexperiment.org/workflows/1841.html (embedded file) or http://www.myexperiment.org/workflows/4748.html (just in a table) Simon -- The Command Line: Reinvented for Modern Developers Did the resurgence of CLI tooling catch you by surprise? Reconnect with the command line and become more productive. Learn the new .NET and ASP.NET CLI. Get your free copy! http://sdm.link/telerik ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] updated SMARTS filters for PAINS
KNIME workflows updated to the new KNIME 3.0.1, the RDKit repository version of the PAINS filters, and the latest RDKit nodes. http://www.myexperiment.org/workflows/1841.html multi-core version: http://www.myexperiment.org/workflows/2485.html -- Cheers, Simon -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] updated SMARTS filters for PAINS
I have the original Sybyl output from Johnathan. It's not in the most friendly format. All I did was run a few sed commands past it to extract the ID numbers, and also compile some frequency tables v. PAINS query. I've sent a zip file to you directly. Simon On 26/08/2015 15:20 , Greg Landrum wrote: Thanks for that. Do you have a version that says which of the molecules hit which PAINS? That would really help with the refinement. -greg -- CSIRO Manufacturing Flagship, phone: +61 3 9545- Bag 10, fax: +61 3 9545-2453 Clayton South VIC 3169, http://www.csiro.au/manufacturing Australiamailto:simon.saub...@csiro.au -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] updated SMARTS filters for PAINS
Attached the original list from Jonathan of the 861 SLN hits. S. On 26/08/2015 13:08 , Greg Landrum wrote: On Wed, Aug 26, 2015 at 2:32 AM, Simon Saubern <mailto:simon.saub...@csiro.au>> wrote: Thanks for doing this Greg. Fixing those SMARTS queries always looked like it would be a real...pain. :-) I've dropped your Github file into the KNIME workflow, and the RDKit version of the workflow (using nodes RDKit 2.5.0.201505221301) now hits 770 structures in the WEHI-10k test set. For what it's worth, I now get 888 matches across the WEHI-10K set when running my Python test script. I am not 100% sure that the KNIME nodes are doing (or can do) the mergeQueryHs step; that's something else for me to follow up on. But that includes 19 false positives that weren't being caught by the SLN filters. One filter alone is responsible for 17 of those false positives: anil_di_alk_C(246) old: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7](-[#6;X4])-[$([#1]),$([#6;X4])] new: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7;!H0,$([#7]-[#6;X4])]-[#6;X4] An example of one of the false positive structures is the aniline sulfonamide WEHI-18518. I've checked with Johnathan, and the intention of that query is that "... that the nitrogen has a single bond to a carbon that has four atoms bonded to it (i.e. sp3), and that the other atom singly bonded to the nitrogen atom is anything so long as it is either H or an sp3 carbon". So no to sulfonamides, and also some of the acetamide (sp2 C) showing up as hits. Thanks for pointing that out and providing the clarification about what is expected! I just committed a fix for this: https://github.com/rdkit/rdkit/commit/e2487ffe79c393a6b0e472882bfb6eb66a3bcb8b As an aside: If you could provide a text file that has the matches found for each pattern in the WEHI-10k test set when you use the SLN version of the PAINS, I would be very happy to use that to further refine these patterns and to incorporate those results into the tests. -greg WEHI-0002757 WEHI-0003718 WEHI-0004345 WEHI-0005047 WEHI-0005752 WEHI-0006137 WEHI-0006195 WEHI-0006607 WEHI-0006892 WEHI-0007328 WEHI-0007435 WEHI-0007798 WEHI-0008187 WEHI-0008558 WEHI-0009314 WEHI-0011538 WEHI-0011957 WEHI-0012384 WEHI-0012615 WEHI-0012702 WEHI-0012773 WEHI-0012790 WEHI-0012829 WEHI-0012939 WEHI-0013053 WEHI-0013276 WEHI-0013384 WEHI-0013507 WEHI-0013892 WEHI-0013909 WEHI-0014006 WEHI-0014370 WEHI-0014546 WEHI-0014816 WEHI-0014836 WEHI-0014902 WEHI-0014937 WEHI-0015069 WEHI-0015806 WEHI-0015833 WEHI-0016142 WEHI-0016145 WEHI-0016287 WEHI-0016293 WEHI-0016316 WEHI-0016680 WEHI-0016735 WEHI-0016897 WEHI-0016957 WEHI-0016962 WEHI-0016985 WEHI-0017369 WEHI-0017518 WEHI-0017809 WEHI-0017964 WEHI-0018024 WEHI-0018269 WEHI-0018910 WEHI-0018941 WEHI-0018980 WEHI-0019026 WEHI-0019035 WEHI-0019132 WEHI-0019903 WEHI-0020161 WEHI-0020193 WEHI-0020284 WEHI-0020337 WEHI-0020458 WEHI-0020926 WEHI-0020933 WEHI-0020934 WEHI-0020935 WEHI-0020941 WEHI-0021184 WEHI-0023024 WEHI-0023287 WEHI-0023407 WEHI-0023681 WEHI-0023788 WEHI-0023867 WEHI-0023878 WEHI-0023997 WEHI-0024471 WEHI-0024472 WEHI-0024647 WEHI-0024825 WEHI-0024863 WEHI-0024880 WEHI-0024921 WEHI-0025079 WEHI-0025267 WEHI-0025330 WEHI-0025376 WEHI-0025383 WEHI-0025388 WEHI-0025503 WEHI-0025579 WEHI-0025580 WEHI-0025582 WEHI-0025928 WEHI-0026032 WEHI-0026074 WEHI-0026076 WEHI-0026387 WEHI-0026861 WEHI-0026867 WEHI-0027388 WEHI-0027950 WEHI-0028261 WEHI-0028555 WEHI-0029002 WEHI-0029119 WEHI-0029150 WEHI-0029798 WEHI-0030010 WEHI-0030096 WEHI-0030547 WEHI-0030565 WEHI-0030575 WEHI-0030680 WEHI-0030930 WEHI-0030934 WEHI-0030951 WEHI-0030982 WEHI-0031003 WEHI-0031038 WEHI-0031099 WEHI-0031466 WEHI-0031501 WEHI-0031558 WEHI-0031567 WEHI-0031580 WEHI-0031588 WEHI-0031724 WEHI-0031740 WEHI-0031760 WEHI-0031812 WEHI-0031877 WEHI-0031964 WEHI-0032008 WEHI-0032062 WEHI-0032098 WEHI-0032137 WEHI-0032203 WEHI-0032316 WEHI-0032441 WEHI-0032550 WEHI-0032578 WEHI-0032654 WEHI-0032721 WEHI-0032885 WEHI-0032911 WEHI-0033083 WEHI-0033129 WEHI-0033323 WEHI-000 WEHI-005 WEHI-0033533 WEHI-0033701 WEHI-0033845 WEHI-0033898 WEHI-0033908 WEHI-0033945 WEHI-0034021 WEHI-0034271 WEHI-0034396 WEHI-0034445 WEHI-0034452 WEHI-0034461 WEHI-0034530 WEHI-0034703 WEHI-0034822 WEHI-0034838 WEHI-0034845 WEHI-0035236 WEHI-0035238 WEHI-0035255 WEHI-0035272 WEHI-0035277 WEHI-0035450 WEHI-0035595 WEHI-0035597 WEHI-0035630 WEHI-0035869 WEHI-0035912 WEHI-0036028 WEHI-0036184 WEHI-0036307 WEHI-0036313 WEHI-0036341 WEHI-0036510 WEHI-0036533 WEHI-0036558 WEHI-0036724 WEHI-0036737 WEHI-0036751 WEHI-0036982 WEHI-0037607 WEHI-0037998 WEHI-0038095 WEHI-0038687 WEHI-0038931 WEHI-0039118 WEHI-0039383 WEHI-0039450 WEHI-0039487 WEHI-0039519 WEHI-0039633 WEHI-004 WEHI-0040073 WEHI-0040109 WEHI-0040114 WEHI-0040305 WEHI-0040558 WEHI-0040725 WEHI-0040837 WEHI-0041008 WEHI-0041069 WEHI-0041267 WEHI-0041687
[Rdkit-discuss] updated SMARTS filters for PAINS
Thanks for doing this Greg. Fixing those SMARTS queries always looked like it would be a real...pain. I've dropped your Github file into the KNIME workflow, and the RDKit version of the workflow (using nodes RDKit 2.5.0.201505221301) now hits 770 structures in the WEHI-10k test set. But that includes 19 false positives that weren't being caught by the SLN filters. One filter alone is responsible for 17 of those false positives: anil_di_alk_C(246) old: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7](-[#6;X4])-[$([#1]),$([#6;X4])] new: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7;!H0,$([#7]-[#6;X4])]-[#6;X4] An example of one of the false positive structures is the aniline sulfonamide WEHI-18518. I've checked with Johnathan, and the intention of that query is that "... that the nitrogen has a single bond to a carbon that has four atoms bonded to it (i.e. sp3), and that the other atom singly bonded to the nitrogen atom is anything so long as it is either H or an sp3 carbon". So no to sulfonamides, and also some of the acetamide (sp2 C) showing up as hits. -- Cheers, Simon -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] PAINS
Nicholas, this is an O(n^2) problem (many-to-many) and difficult to make efficient. It is, however, 'embarrassingly parallel' so you can take advantage of multiple cores. Have a look at how these 2 KNIME workflows implement the PAINS filters with the RDKit nodes in KNIME: http://www.myexperiment.org/workflows/1841.html http://www.myexperiment.org/workflows/2485.html -- Cheers, Simon -- Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] KNIME - Treatment of H in 2.0.0.1061 nodes
So the 2.0.0.1088 nodes now generate 636 matches and only 2 false positives: WEHI-0054407S=C1N(C(=C(C2=C1CN(C(C2)(C)C)C)C#N)N)C [#6](-[#1])(-[#1])-[#7]([#6]:[#6])~[#6][#6]=,:[#6]-[#6]~[#6][#7] dyes5A(27) WEHI-0063070N1C(=NC=C(C1=O)C)NN=Cc2ccc(cc2)N(C)C [#6](-[#1])(-[#1])-[#7](-[#6](-[#1])-[#1])-c:1:c(:c(:c(:c(:c:1-[#1])-[#1])-[#6](-[#1])=[#7]-[#7]-[$([#6](=[#8])-[#6](-[#1])(-[#1])-[#16]-[#6]:[#7]),$([#6](=[#8])-[#6](-[#1])(-[#1])-[!#1]:[!#1]:[#7]),$([#6](=[#8])-[#6]:[#6]-[#8]-[#1]),$([#6]:[#7]),$([#6](-[#1])(-[#1])-[#6](-[#1])-[#8]-[#1])])-[#1])-[#1] hzone_anil_di_alk(35) -- Cheers, Simon -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2dcopy2 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] KNIME - Treatment of H in 2.0.0.1061 nodes
Hi Greg, The recent updates to the way explicit hydrogens are handled in the RDKit nodes for KNIME http://goo.gl/DK0FS have dramatically improved the number of correct matches that we observe when using the PAINS filters workflow http://goo.gl/T9mT2 . Against the reference set from WEHI, we're now seeing 652 matches (up from 329), but we also now get 231 false positives where we were getting none before. Attached is a tab-sep file containing the mis-matches (regID, smiles, smarts, smartsID). The smarts strings come from Raj's blog: http://blog.rguha.net/?p=850. Let us know if you need additional info to diagnose what's going on. -- Cheers, Simon %RDKIT2-231.txt Description: application/applefile RDKIT2-231.txt Description: Binary data -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss