Re: [Rdkit-discuss] drawing code take 3
Dimitri, Just for the record, you responded directly to a quote of mine. Now you say that your objections were to using numbers that appeared in a different quote by somebody else. Personally, I think those numbers are indeed applicable. Nobody would think of doing this on a single CPU, so I'm not sure why you think somebody was suggesting that. But either way, I move that we end this thread. The issues and possible solutions are out on the table, and all of us can now, as they say, "pay our money and take our choice." Best, -P. On Dec 29, 2016 5:06 PM, "Dimitri Maziuk"wrote: > On 12/29/2016 02:35 PM, Peter S. Shenkin wrote: > > Dimitri, > > > > You were the one who suggested that all the structural depictions be > > generated. > > > > I, in contrast, suggested that only the ones users need to look at need > be > > generated. I further suggested that these would only constitute a small > > fraction of those in a large DB. > > My objection was to using numbers like > > > ... for 92877507 > > structures (current size PubChem Compound): > > 1s per structure = 1074 days (~3 years) > > 100 ms per structure = 107 days > > 1ms per structure = 25 hours > > as if they actually mean something. > > I responded that *if* the requirement is to generate all 100M > depictions, making the code faster on a single CPU core is rarely the > cost-effective solution. That was a purely academic "if" because I don't > believe that regenerating all the depictions at once on a regular basis > is a realistic use case, either. > > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 12/29/2016 02:35 PM, Peter S. Shenkin wrote: > Dimitri, > > You were the one who suggested that all the structural depictions be > generated. > > I, in contrast, suggested that only the ones users need to look at need be > generated. I further suggested that these would only constitute a small > fraction of those in a large DB. My objection was to using numbers like > ... for 92877507 > structures (current size PubChem Compound): > 1s per structure = 1074 days (~3 years) > 100 ms per structure = 107 days > 1ms per structure = 25 hours as if they actually mean something. I responded that *if* the requirement is to generate all 100M depictions, making the code faster on a single CPU core is rarely the cost-effective solution. That was a purely academic "if" because I don't believe that regenerating all the depictions at once on a regular basis is a realistic use case, either. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Dimitri, You were the one who suggested that all the structural depictions be generated. I, in contrast, suggested that only the ones users need to look at need be generated. I further suggested that these would only constitute a small fraction of those in a large DB. -P. On Thu, Dec 29, 2016 at 2:49 PM, Dimitri Maziukwrote: > On 12/29/2016 12:43 PM, Peter S. Shenkin wrote: > > > Of the > > billion structures, only a fraction will ever be visualized, so a > > memoization strategy sounds reasonable, which in turn implies that you > want > > rapid response when an unstored structure has to be generated. > > :) > > Now I have a mental picture of a phd student tied to a chair with his > eyes taped open, forced to look at a billion depictions for 10ms each. > > Pictures are only useful if you have a human looking at them. Looking is > only useful if you do it long enough for the brain to process it. The > whole "what if we need a billion depictions all at once" implies that > you have a billion users looking at them all at once. If you don't, then > rapid response is a very interesting academic exercise but its practical > usefulness might be somewhat questionable. > > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 12/29/2016 12:43 PM, Peter S. Shenkin wrote: > Of the > billion structures, only a fraction will ever be visualized, so a > memoization strategy sounds reasonable, which in turn implies that you want > rapid response when an unstored structure has to be generated. :) Now I have a mental picture of a phd student tied to a chair with his eyes taped open, forced to look at a billion depictions for 10ms each. Pictures are only useful if you have a human looking at them. Looking is only useful if you do it long enough for the brain to process it. The whole "what if we need a billion depictions all at once" implies that you have a billion users looking at them all at once. If you don't, then rapid response is a very interesting academic exercise but its practical usefulness might be somewhat questionable. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 12/29/2016 12:43 PM, Peter S. Shenkin wrote: > Look, it all boils down to (CPU) time, and time is money. It's very hard to say how much a single cpu core actually costs 'cause they don't make them anymore. Similarly, our small molecule SVGs average at around 4K, storing 10M of those will require about 40GB and they don't make disks that small anymore either. 64GB USB stick is twenty bucks. I've no idea how much I actually cost our funding agency per hour, nor how many hours it would take me to even figure out if a piece of code of any kind of complexity can be optimized. But I can guarantee you that a) it's much more than $20, and b) hiring a competent programmer will cost you more than buying a "better computer" and is not guaranteed to result in any appreciable speed-up. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Look, it all boils down to (CPU) time, and time is money. Generating a billion depictions on the cloud will cost you the use of the machines. Increasing the depiction speed by a factor of 10 decreases the cost by a factor of 10, to a pretty good approximation. Storage is also money, so it doesn't always make sense to store all N structures up front, if N is large. In some contexts, it makes more sense to generate the 2d reps as needed, rather than store them all in advance. One size doesn't fit all. An intermediate strategy would be to generate the depictions on the fly and memoize them for some time or up to some maximum storage limit. Of the billion structures, only a fraction will ever be visualized, so a memoization strategy sounds reasonable, which in turn implies that you want rapid response when an unstored structure has to be generated. -P. On Thu, Dec 29, 2016 at 12:04 PM, Dimitri Maziukwrote: > On 2016-12-29 07:19, John M wrote: > > > For why you need sub-second depiction consider these times for 92877507 > > structures (current size PubChem Compound): > > > > 1s per structure = 1074 days (~3 years) > > 100 ms per structure = 107 days > > 1ms per structure = 25 hours > > The Dilbert answer is buy a better computer. The serious answer is if > you run millions of jobs sequentially on a single core, your problem is > not how long a single job takes: no matter how fast you can make it, it > will only scale linearly. There will be 1B compounds in PubChem two > years from now and your painstakingly crafted 1ms/structure code will > still take 3 years, the only difference is you get garbage depictions. > > Condor can be persuaded fire up 92877507 EC2 VMs and run all of those in > parallel -- provided you're willing to pay Amazon for it of course. If > you can code the algorithm into GPGPU/SIMD parallel flow, you can > probably push it into an FPGA and then get that baked into ASICs in > China -- they'll give you discount if you order more than ten thousand. > That gets you a $20 USB dongle that will run them at umpteen K/second. > And so on. > > If you don't want quality depictions because bad ones will work just > fine for your needs, that's a perfectly good argument. If you don't want > them because generating 10M sequentially on a single core will take a > long time, that's BS argument. > > Dima > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 2016-12-29 07:19, John M wrote: > For why you need sub-second depiction consider these times for 92877507 > structures (current size PubChem Compound): > > 1s per structure = 1074 days (~3 years) > 100 ms per structure = 107 days > 1ms per structure = 25 hours The Dilbert answer is buy a better computer. The serious answer is if you run millions of jobs sequentially on a single core, your problem is not how long a single job takes: no matter how fast you can make it, it will only scale linearly. There will be 1B compounds in PubChem two years from now and your painstakingly crafted 1ms/structure code will still take 3 years, the only difference is you get garbage depictions. Condor can be persuaded fire up 92877507 EC2 VMs and run all of those in parallel -- provided you're willing to pay Amazon for it of course. If you can code the algorithm into GPGPU/SIMD parallel flow, you can probably push it into an FPGA and then get that baked into ASICs in China -- they'll give you discount if you order more than ten thousand. That gets you a $20 USB dongle that will run them at umpteen K/second. And so on. If you don't want quality depictions because bad ones will work just fine for your needs, that's a perfectly good argument. If you don't want them because generating 10M sequentially on a single core will take a long time, that's BS argument. Dima -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
As a thought, it might make sense to consider a distinction between publication-quality images and "pretty good" images. The latter require speed and clarity, whereas a number of additional niceties (I hate to use the word "elegance") would be highly desirable for the former, even at the expense of speed. For example, for publication-quality images, one might try to adhere more closely to the IUPAC recommendations for 2D depictions. -P. On Thu, Dec 29, 2016 at 8:53 AM, Brian Kelleywrote: > Perhaps we could train a ML algorithm to know which algorithm to use when > :) > > Cheers, > Brian > > On Thu, Dec 29, 2016 at 8:19 AM, John M > wrote: > >> Hi Peter, >> >> I uploaded the benchmark set here: https://github.com/johnm >> ay/layout-benchmark and have tested on their web service a few weeks >> ago. IIRC it did seem quite slow, maybe fine for ahead of time generation >> but not usable for on demand depiction. It does produce very nice >> depictions but I think the right way to go is described by Alex Clark (2006 >> I think?) and used by MOE. Essentially use optimisation for certain >> parts/classes of structure but not everything. >> >> Unfortunately no comparison to MOE/ChemDraw in the paper. >> >> For why you need sub-second depiction consider these times for 92877507 >> structures (current size PubChem Compound): >> >> 1s per structure = 1074 days (~3 years) >> 100 ms per structure = 107 days >> 1ms per structure = 25 hours >> >> John >> >> On 15 December 2016 at 23:12, Peter S. Shenkin wrote: >> >>> Yes, of course, storing the images is an alternative. >>> >>> -P. >>> >>> On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziuk >>> wrote: >>> On 12/15/2016 04:23 PM, Peter S. Shenkin wrote: > Obviously, it doesn't matter if you're rendering just few structures, but > in a scenario where you might be downloading a hundred SMILES from a DB and > displaying them on a grid in a browser, computing the 2D depictions on the > fly, waiting 5 sec for a page refresh wouldn't be great. Maybe not, but depending how the browser lays out the grid, it may take 5 seconds anyway. My recommendation for that use case would be to pre-generate the images and store the URLs in that database. Which is what we do here. ;) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu >>> >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Perhaps we could train a ML algorithm to know which algorithm to use when :) Cheers, Brian On Thu, Dec 29, 2016 at 8:19 AM, John Mwrote: > Hi Peter, > > I uploaded the benchmark set here: https://github.com/ > johnmay/layout-benchmark and have tested on their web service a few weeks > ago. IIRC it did seem quite slow, maybe fine for ahead of time generation > but not usable for on demand depiction. It does produce very nice > depictions but I think the right way to go is described by Alex Clark (2006 > I think?) and used by MOE. Essentially use optimisation for certain > parts/classes of structure but not everything. > > Unfortunately no comparison to MOE/ChemDraw in the paper. > > For why you need sub-second depiction consider these times for 92877507 > structures (current size PubChem Compound): > > 1s per structure = 1074 days (~3 years) > 100 ms per structure = 107 days > 1ms per structure = 25 hours > > John > > On 15 December 2016 at 23:12, Peter S. Shenkin wrote: > >> Yes, of course, storing the images is an alternative. >> >> -P. >> >> On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziuk >> wrote: >> >>> On 12/15/2016 04:23 PM, Peter S. Shenkin wrote: >>> >>> > Obviously, it doesn't matter if you're rendering just few structures, >>> but >>> > in a scenario where you might be downloading a hundred SMILES from a >>> DB and >>> > displaying them on a grid in a browser, computing the 2D depictions on >>> the >>> > fly, waiting 5 sec for a page refresh wouldn't be great. >>> >>> Maybe not, but depending how the browser lays out the grid, it may take >>> 5 seconds anyway. >>> >>> My recommendation for that use case would be to pre-generate the images >>> and store the URLs in that database. Which is what we do here. >>> >>> ;) >>> -- >>> Dimitri Maziuk >>> Programmer/sysadmin >>> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu >>> >>> >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Hi Peter, I uploaded the benchmark set here: https://github.com/johnmay/layout-benchmark and have tested on their web service a few weeks ago. IIRC it did seem quite slow, maybe fine for ahead of time generation but not usable for on demand depiction. It does produce very nice depictions but I think the right way to go is described by Alex Clark (2006 I think?) and used by MOE. Essentially use optimisation for certain parts/classes of structure but not everything. Unfortunately no comparison to MOE/ChemDraw in the paper. For why you need sub-second depiction consider these times for 92877507 structures (current size PubChem Compound): 1s per structure = 1074 days (~3 years) 100 ms per structure = 107 days 1ms per structure = 25 hours John On 15 December 2016 at 23:12, Peter S. Shenkinwrote: > Yes, of course, storing the images is an alternative. > > -P. > > On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziuk > wrote: > >> On 12/15/2016 04:23 PM, Peter S. Shenkin wrote: >> >> > Obviously, it doesn't matter if you're rendering just few structures, >> but >> > in a scenario where you might be downloading a hundred SMILES from a DB >> and >> > displaying them on a grid in a browser, computing the 2D depictions on >> the >> > fly, waiting 5 sec for a page refresh wouldn't be great. >> >> Maybe not, but depending how the browser lays out the grid, it may take >> 5 seconds anyway. >> >> My recommendation for that use case would be to pre-generate the images >> and store the URLs in that database. Which is what we do here. >> >> ;) >> -- >> Dimitri Maziuk >> Programmer/sysadmin >> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu >> >> > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Yes, of course, storing the images is an alternative. -P. On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziukwrote: > On 12/15/2016 04:23 PM, Peter S. Shenkin wrote: > > > Obviously, it doesn't matter if you're rendering just few structures, but > > in a scenario where you might be downloading a hundred SMILES from a DB > and > > displaying them on a grid in a browser, computing the 2D depictions on > the > > fly, waiting 5 sec for a page refresh wouldn't be great. > > Maybe not, but depending how the browser lays out the grid, it may take > 5 seconds anyway. > > My recommendation for that use case would be to pre-generate the images > and store the URLs in that database. Which is what we do here. > > ;) > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 12/15/2016 04:23 PM, Peter S. Shenkin wrote: > Obviously, it doesn't matter if you're rendering just few structures, but > in a scenario where you might be downloading a hundred SMILES from a DB and > displaying them on a grid in a browser, computing the 2D depictions on the > fly, waiting 5 sec for a page refresh wouldn't be great. Maybe not, but depending how the browser lays out the grid, it may take 5 seconds anyway. My recommendation for that use case would be to pre-generate the images and store the URLs in that database. Which is what we do here. ;) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Well, Figure 10 shows that a molecule with about 25 heavy atoms takes about 50 ms to optimize. In John Mayfield's UGM talk, it looks like CDK is taking an average of 1 ms for "easy" structures and 56 ms for the hard ones, some of which are depicted and have far more than 25 heavy atoms. We don't know the details of the two data sets, so a head-to-head comparison is tough, but intuitively, 20 structures/sec sounds slow. Having said that, it's reasonable to pay a price in speed for additional quality and robustness. Obviously, it doesn't matter if you're rendering just few structures, but in a scenario where you might be downloading a hundred SMILES from a DB and displaying them on a grid in a browser, computing the 2D depictions on the fly, waiting 5 sec for a page refresh wouldn't be great. -P. On Thu, Dec 15, 2016 at 4:22 PM, Dimitri Maziukwrote: > On 12/15/2016 02:53 PM, Peter S. Shenkin wrote: > > Looks good, but maybe too slow for production use... (?) > > I wonder what kind of production use would require sub-second wall clock > time for this. > > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 12/15/2016 02:53 PM, Peter S. Shenkin wrote: > Looks good, but maybe too slow for production use... (?) I wonder what kind of production use would require sub-second wall clock time for this. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Looks good, but maybe too slow for production use... (?) -P. On Thu, Dec 15, 2016 at 3:38 PM, Chris Swainwrote: > At first glance this looks an interesting approach. > > Simulation-Based Algorithm for Two-Dimensional Chemical Structure Diagram > Generation of Complex Molecules and Ligand–Protein Interactions > DOI: http://dx.doi.org/10.1021/acs.jcim.6b00391 > > On 27 Sep 2016, at 05:38, rdkit-discuss-requ...@lists.sourceforge.net > wrote: > > 2D drawing code is tough. The 90/10 rule applies: the last 10% of > correctness takes 90% of the effort. > > I like Dmitri Agrafiotis's method, but IIRC it's patented; also, though > it's good for rough work, it doesn't produce "beautiful" structural > diagrams. > > Some of the 2D drawing methods that do produce "pretty" pictures have a > large number of templates built in that match the most common (and even > somewhat uncommon) motifs, and they fall down when they hit something they > can't get a close enough match for. And then, the IUPAC has a whole list of > "desirable" features in 2D diagrams (as in, "Don't show it this way, but > rather show it that way."). So even if you produce what might appear to be > an acceptable drawing, it might not match the IUPAC list of desirables. > > I think for the present purposes what we need is something correct, robust > and legible, and of course the example shown does not exhibit that. (But I > don't know what the starting SMILES is, so I don't know whether the > 7-bonded C is due to a bad SMILES, in which case all bets are off.) > > In addition, I think some discussion earlier indicated that the RDKit 2D > structures look much worse when H's are included. > > I actually wrote a code one time (while at Schr?dinger) to give a "badness" > score to 2D structures. When our 2D depiction development was in progress, > we created 2D SD files for many thousands of structures. I could put these > through the program and sort with the worst on top. That allowed the most > severe problems to be identified more quickly than, say, looking at > thousands of 2D diagrams. The program looked at three things: Number of > bonds that crossed, Number of atoms that were too close together, and Large > disparity of bond lengths within the same molecule. (The checking code > didn't deal with labels.) > > Writing the checker was a fun project, but I'm glad I didn't have to write > the 2D depiction code. As Mark Twain said, "Improving oneself is good. > Improving others is better ? and easier." > > -P. > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
At first glance this looks an interesting approach. Simulation-Based Algorithm for Two-Dimensional Chemical Structure Diagram Generation of Complex Molecules and Ligand–Protein Interactions DOI: http://dx.doi.org/10.1021/acs.jcim.6b00391 > On 27 Sep 2016, at 05:38, rdkit-discuss-requ...@lists.sourceforge.net wrote: > > 2D drawing code is tough. The 90/10 rule applies: the last 10% of > correctness takes 90% of the effort. > > I like Dmitri Agrafiotis's method, but IIRC it's patented; also, though > it's good for rough work, it doesn't produce "beautiful" structural > diagrams. > > Some of the 2D drawing methods that do produce "pretty" pictures have a > large number of templates built in that match the most common (and even > somewhat uncommon) motifs, and they fall down when they hit something they > can't get a close enough match for. And then, the IUPAC has a whole list of > "desirable" features in 2D diagrams (as in, "Don't show it this way, but > rather show it that way."). So even if you produce what might appear to be > an acceptable drawing, it might not match the IUPAC list of desirables. > > I think for the present purposes what we need is something correct, robust > and legible, and of course the example shown does not exhibit that. (But I > don't know what the starting SMILES is, so I don't know whether the > 7-bonded C is due to a bad SMILES, in which case all bets are off.) > > In addition, I think some discussion earlier indicated that the RDKit 2D > structures look much worse when H's are included. > > I actually wrote a code one time (while at Schr?dinger) to give a "badness" > score to 2D structures. When our 2D depiction development was in progress, > we created 2D SD files for many thousands of structures. I could put these > through the program and sort with the worst on top. That allowed the most > severe problems to be identified more quickly than, say, looking at > thousands of 2D diagrams. The program looked at three things: Number of > bonds that crossed, Number of atoms that were too close together, and Large > disparity of bond lengths within the same molecule. (The checking code > didn't deal with labels.) > > Writing the checker was a fun project, but I'm glad I didn't have to write > the 2D depiction code. As Mark Twain said, "Improving oneself is good. > Improving others is better ? and easier." > > -P. -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 2016-09-26 18:19, Peter S. Shenkin wrote: > 2D drawing code is tough. The 90/10 rule applies: the last 10% of > I think for the present purposes what we need is something correct, > robust and legible, and of course the example shown does not exhibit > that. (But I don't know what the starting SMILES is, so I don't know > whether the 7-bonded C is due to a bad SMILES, in which case all bets > are off.) That was actually a "kudos to RDKit" post. I have an application where I need a drawing with all Hs and all atom labels, and molecule description in mmCIF(-ish) format. I use RDKit for the latter because of OpenBabel's stereochemistry "model", and OpenBabel for the drawings because 90% of the time it generates better layouts. THE comment is that RDKit's layout algorithm appears to be more stable: for this molecule OB generated a "better" picture from the original SDF downloaded from PubChem, and that complete mess when we re-ordered the atoms. RDKit generated the same picture in both cases. only one is a mirror image of the other. Dima -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
2D drawing code is tough. The 90/10 rule applies: the last 10% of correctness takes 90% of the effort. I like Dmitri Agrafiotis's method, but IIRC it's patented; also, though it's good for rough work, it doesn't produce "beautiful" structural diagrams. Some of the 2D drawing methods that do produce "pretty" pictures have a large number of templates built in that match the most common (and even somewhat uncommon) motifs, and they fall down when they hit something they can't get a close enough match for. And then, the IUPAC has a whole list of "desirable" features in 2D diagrams (as in, "Don't show it this way, but rather show it that way."). So even if you produce what might appear to be an acceptable drawing, it might not match the IUPAC list of desirables. I think for the present purposes what we need is something correct, robust and legible, and of course the example shown does not exhibit that. (But I don't know what the starting SMILES is, so I don't know whether the 7-bonded C is due to a bad SMILES, in which case all bets are off.) In addition, I think some discussion earlier indicated that the RDKit 2D structures look much worse when H's are included. I actually wrote a code one time (while at Schrödinger) to give a "badness" score to 2D structures. When our 2D depiction development was in progress, we created 2D SD files for many thousands of structures. I could put these through the program and sort with the worst on top. That allowed the most severe problems to be identified more quickly than, say, looking at thousands of 2D diagrams. The program looked at three things: Number of bonds that crossed, Number of atoms that were too close together, and Large disparity of bond lengths within the same molecule. (The checking code didn't deal with labels.) Writing the checker was a fun project, but I'm glad I didn't have to write the 2D depiction code. As Mark Twain said, "Improving oneself is good. Improving others is better – and easier." -P. On Mon, Sep 26, 2016 at 5:54 PM, Dimitri Maziukwrote: > On 09/26/2016 04:42 PM, Peter S. Shenkin wrote: > > Also, the C attached to H44 has an extra H (its own or someone else's?) > > superimposed upon it. > > I wonder if 2D drawing code should really work the same way as the 3D > conformer generation: generate a bunch of candidate layouts and pick the > one(s) with least clashes/overlaps. > > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > > > > -- > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
On 09/26/2016 04:42 PM, Peter S. Shenkin wrote: > Also, the C attached to H44 has an extra H (its own or someone else's?) > superimposed upon it. I wonder if 2D drawing code should really work the same way as the 3D conformer generation: generate a bunch of candidate layouts and pick the one(s) with least clashes/overlaps. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] drawing code take 3
Also, the C attached to H44 has an extra H (its own or someone else's?) superimposed upon it. -P. On Mon, Sep 26, 2016 at 5:38 PM, Dimitri Maziukwrote: > > On the plus side, when drawing PubChem CID 5057 from a 3D SDF before and > after our canonicalization, RDKit draws a mirror image, but otherwise > the same 2D structure. OB's "after" version is attached: enjoy the > 7-bond carbon in the ring. > > ;) > -- > Dimitri Maziuk > Programmer/sysadmin > BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu > > > -- > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] drawing code take 3
On the plus side, when drawing PubChem CID 5057 from a 3D SDF before and after our canonicalization, RDKit draws a mirror image, but otherwise the same 2D structure. OB's "after" version is attached: enjoy the 7-bond carbon in the ring. ;) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss