Re: [Rdkit-discuss] RDKit workflow in KNIME
Stuart, The PAINS file is available in the RDKIT Github repository. If that's too complicated deal with at this early stage, try some of the workflows on myexperiment.org: http://www.myexperiment.org/workflows/1841.html (embedded file) or http://www.myexperiment.org/workflows/4748.html (just in a table) Simon -- The Command Line: Reinvented for Modern Developers Did the resurgence of CLI tooling catch you by surprise? Reconnect with the command line and become more productive. Learn the new .NET and ASP.NET CLI. Get your free copy! http://sdm.link/telerik ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] updated SMARTS filters for PAINS
KNIME workflows updated to the new KNIME 3.0.1, the RDKit repository version of the PAINS filters, and the latest RDKit nodes. http://www.myexperiment.org/workflows/1841.html multi-core version: http://www.myexperiment.org/workflows/2485.html -- Cheers, Simon -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] updated SMARTS filters for PAINS
I have the original Sybyl output from Johnathan. It's not in the most friendly format. All I did was run a few sed commands past it to extract the ID numbers, and also compile some frequency tables v. PAINS query. I've sent a zip file to you directly. Simon On 26/08/2015 15:20 , Greg Landrum wrote: Thanks for that. Do you have a version that says which of the molecules hit which PAINS? That would really help with the refinement. -greg -- CSIRO Manufacturing Flagship, phone: +61 3 9545- Bag 10, fax: +61 3 9545-2453 Clayton South VIC 3169, http://www.csiro.au/manufacturing Australiamailto:simon.saub...@csiro.au -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] updated SMARTS filters for PAINS
Attached the original list from Jonathan of the 861 SLN hits. S. On 26/08/2015 13:08 , Greg Landrum wrote: On Wed, Aug 26, 2015 at 2:32 AM, Simon Saubern <mailto:simon.saub...@csiro.au>> wrote: Thanks for doing this Greg. Fixing those SMARTS queries always looked like it would be a real...pain. :-) I've dropped your Github file into the KNIME workflow, and the RDKit version of the workflow (using nodes RDKit 2.5.0.201505221301) now hits 770 structures in the WEHI-10k test set. For what it's worth, I now get 888 matches across the WEHI-10K set when running my Python test script. I am not 100% sure that the KNIME nodes are doing (or can do) the mergeQueryHs step; that's something else for me to follow up on. But that includes 19 false positives that weren't being caught by the SLN filters. One filter alone is responsible for 17 of those false positives: anil_di_alk_C(246) old: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7](-[#6;X4])-[$([#1]),$([#6;X4])] new: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7;!H0,$([#7]-[#6;X4])]-[#6;X4] An example of one of the false positive structures is the aniline sulfonamide WEHI-18518. I've checked with Johnathan, and the intention of that query is that "... that the nitrogen has a single bond to a carbon that has four atoms bonded to it (i.e. sp3), and that the other atom singly bonded to the nitrogen atom is anything so long as it is either H or an sp3 carbon". So no to sulfonamides, and also some of the acetamide (sp2 C) showing up as hits. Thanks for pointing that out and providing the clarification about what is expected! I just committed a fix for this: https://github.com/rdkit/rdkit/commit/e2487ffe79c393a6b0e472882bfb6eb66a3bcb8b As an aside: If you could provide a text file that has the matches found for each pattern in the WEHI-10k test set when you use the SLN version of the PAINS, I would be very happy to use that to further refine these patterns and to incorporate those results into the tests. -greg
[Rdkit-discuss] updated SMARTS filters for PAINS
Thanks for doing this Greg. Fixing those SMARTS queries always looked like it would be a real...pain. I've dropped your Github file into the KNIME workflow, and the RDKit version of the workflow (using nodes RDKit 2.5.0.201505221301) now hits 770 structures in the WEHI-10k test set. But that includes 19 false positives that weren't being caught by the SLN filters. One filter alone is responsible for 17 of those false positives: anil_di_alk_C(246) old: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7](-[#6;X4])-[$([#1]),$([#6;X4])] new: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7;!H0,$([#7]-[#6;X4])]-[#6;X4] An example of one of the false positive structures is the aniline sulfonamide WEHI-18518. I've checked with Johnathan, and the intention of that query is that "... that the nitrogen has a single bond to a carbon that has four atoms bonded to it (i.e. sp3), and that the other atom singly bonded to the nitrogen atom is anything so long as it is either H or an sp3 carbon". So no to sulfonamides, and also some of the acetamide (sp2 C) showing up as hits. -- Cheers, Simon -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] PAINS
Nicholas, this is an O(n^2) problem (many-to-many) and difficult to make efficient. It is, however, 'embarrassingly parallel' so you can take advantage of multiple cores. Have a look at how these 2 KNIME workflows implement the PAINS filters with the RDKit nodes in KNIME: http://www.myexperiment.org/workflows/1841.html http://www.myexperiment.org/workflows/2485.html -- Cheers, Simon -- Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] KNIME - Treatment of H in 2.0.0.1061 nodes
So the 2.0.0.1088 nodes now generate 636 matches and only 2 false positives: WEHI-0054407S=C1N(C(=C(C2=C1CN(C(C2)(C)C)C)C#N)N)C [#6](-[#1])(-[#1])-[#7]([#6]:[#6])~[#6][#6]=,:[#6]-[#6]~[#6][#7] dyes5A(27) WEHI-0063070N1C(=NC=C(C1=O)C)NN=Cc2ccc(cc2)N(C)C [#6](-[#1])(-[#1])-[#7](-[#6](-[#1])-[#1])-c:1:c(:c(:c(:c(:c:1-[#1])-[#1])-[#6](-[#1])=[#7]-[#7]-[$([#6](=[#8])-[#6](-[#1])(-[#1])-[#16]-[#6]:[#7]),$([#6](=[#8])-[#6](-[#1])(-[#1])-[!#1]:[!#1]:[#7]),$([#6](=[#8])-[#6]:[#6]-[#8]-[#1]),$([#6]:[#7]),$([#6](-[#1])(-[#1])-[#6](-[#1])-[#8]-[#1])])-[#1])-[#1] hzone_anil_di_alk(35) -- Cheers, Simon -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2dcopy2 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] KNIME - Treatment of H in 2.0.0.1061 nodes
Hi Greg, The recent updates to the way explicit hydrogens are handled in the RDKit nodes for KNIME http://goo.gl/DK0FS have dramatically improved the number of correct matches that we observe when using the PAINS filters workflow http://goo.gl/T9mT2 . Against the reference set from WEHI, we're now seeing 652 matches (up from 329), but we also now get 231 false positives where we were getting none before. Attached is a tab-sep file containing the mis-matches (regID, smiles, smarts, smartsID). The smarts strings come from Raj's blog: http://blog.rguha.net/?p=850. Let us know if you need additional info to diagnose what's going on. -- Cheers, Simon %RDKIT2-231.txt Description: application/applefile RDKIT2-231.txt Description: Binary data -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss