Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Peter S. Shenkin
Dimitri,

Just for the record, you responded directly to a quote of mine.

Now you say that your objections were to using numbers that appeared in a
different quote by somebody else.

Personally, I think those numbers are indeed applicable. Nobody would think
of doing this on a single CPU, so I'm not sure why you think somebody was
suggesting that.

But either way, I move that we end this thread. The issues and possible
solutions are out on the table, and all of us can now, as they say, "pay
our money and take our choice."

Best,
-P.

On Dec 29, 2016 5:06 PM, "Dimitri Maziuk"  wrote:

> On 12/29/2016 02:35 PM, Peter S. Shenkin wrote:
> > Dimitri,
> >
> > You were the one who suggested that all the structural depictions be
> > generated.
> >
> > I, in contrast, suggested that only the ones users need to look at need
> be
> > generated. I further suggested that these would only constitute a small
> > fraction of those in a large DB.
>
> My objection was to using numbers like
>
> > ... for 92877507
> > structures (current size PubChem Compound):
> > 1s per structure = 1074 days (~3 years)
> > 100 ms per structure = 107 days
> > 1ms per structure = 25 hours
>
> as if they actually mean something.
>
> I responded that *if* the requirement is to generate all 100M
> depictions, making the code faster on a single CPU core is rarely the
> cost-effective solution. That was a purely academic "if" because I don't
> believe that regenerating all the depictions at once on a regular basis
> is a realistic use case, either.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Dimitri Maziuk
On 12/29/2016 02:35 PM, Peter S. Shenkin wrote:
> Dimitri,
> 
> You were the one who suggested that all the structural depictions be
> generated.
> 
> I, in contrast, suggested that only the ones users need to look at need be
> generated. I further suggested that these would only constitute a small
> fraction of those in a large DB.

My objection was to using numbers like

> ... for 92877507
> structures (current size PubChem Compound):
> 1s per structure = 1074 days (~3 years)
> 100 ms per structure = 107 days
> 1ms per structure = 25 hours

as if they actually mean something.

I responded that *if* the requirement is to generate all 100M
depictions, making the code faster on a single CPU core is rarely the
cost-effective solution. That was a purely academic "if" because I don't
believe that regenerating all the depictions at once on a regular basis
is a realistic use case, either.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Peter S. Shenkin
Dimitri,

You were the one who suggested that all the structural depictions be
generated.

I, in contrast, suggested that only the ones users need to look at need be
generated. I further suggested that these would only constitute a small
fraction of those in a large DB.

-P.

On Thu, Dec 29, 2016 at 2:49 PM, Dimitri Maziuk 
wrote:

> On 12/29/2016 12:43 PM, Peter S. Shenkin wrote:
>
> > Of the
> > billion structures, only a fraction will ever be visualized, so a
> > memoization strategy sounds reasonable, which in turn implies that you
> want
> > rapid response when an unstored structure has to be generated.
>
> :)
>
> Now I have a mental picture of a phd student tied to a chair with his
> eyes taped open, forced to look at a billion depictions for 10ms each.
>
> Pictures are only useful if you have a human looking at them. Looking is
> only useful if you do it long enough for the brain to process it. The
> whole "what if we need a billion depictions all at once" implies that
> you have a billion users looking at them all at once. If you don't, then
> rapid response is a very interesting academic exercise but its practical
> usefulness might be somewhat questionable.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Dimitri Maziuk
On 12/29/2016 12:43 PM, Peter S. Shenkin wrote:

> Of the
> billion structures, only a fraction will ever be visualized, so a
> memoization strategy sounds reasonable, which in turn implies that you want
> rapid response when an unstored structure has to be generated.

:)

Now I have a mental picture of a phd student tied to a chair with his
eyes taped open, forced to look at a billion depictions for 10ms each.

Pictures are only useful if you have a human looking at them. Looking is
only useful if you do it long enough for the brain to process it. The
whole "what if we need a billion depictions all at once" implies that
you have a billion users looking at them all at once. If you don't, then
rapid response is a very interesting academic exercise but its practical
usefulness might be somewhat questionable.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Dimitri Maziuk
On 12/29/2016 12:43 PM, Peter S. Shenkin wrote:
> Look, it all boils down to (CPU) time, and time is money.

It's very hard to say how much a single cpu core actually costs 'cause
they don't make them anymore. Similarly, our small molecule SVGs average
at around 4K, storing 10M of those will require about 40GB and they
don't make disks that small anymore either. 64GB USB stick is twenty bucks.

I've no idea how much I actually cost our funding agency per hour, nor
how many hours it would take me to even figure out if a piece of code of
any kind of complexity can be optimized. But I can guarantee you that a)
it's much more than $20, and b) hiring a competent programmer will cost
you more than buying a "better computer" and is not guaranteed to result
in any appreciable speed-up.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Peter S. Shenkin
Look, it all boils down to (CPU) time, and time is money. Generating a
billion depictions on the cloud will cost you the use of the machines.
Increasing the depiction speed by a factor of 10 decreases the cost by a
factor of 10, to a pretty good approximation. Storage is also money, so it
doesn't always make sense to store all N structures up front, if N is
large. In some contexts, it makes more sense to generate the 2d reps as
needed, rather than store them all in advance. One size doesn't fit all.

An intermediate strategy would be to generate the depictions on the fly and
memoize them for some time or up to some maximum storage limit. Of the
billion structures, only a fraction will ever be visualized, so a
memoization strategy sounds reasonable, which in turn implies that you want
rapid response when an unstored structure has to be generated.

-P.

On Thu, Dec 29, 2016 at 12:04 PM, Dimitri Maziuk 
wrote:

> On 2016-12-29 07:19, John M wrote:
>
> > For why you need sub-second depiction consider these times for 92877507
> > structures (current size PubChem Compound):
> >
> > 1s per structure = 1074 days (~3 years)
> > 100 ms per structure = 107 days
> > 1ms per structure = 25 hours
>
> The Dilbert answer is buy a better computer. The serious answer is if
> you run millions of jobs sequentially on a single core, your problem is
> not how long a single job takes: no matter how fast you can make it, it
> will only scale linearly. There will be 1B compounds in PubChem two
> years from now and your painstakingly crafted 1ms/structure code will
> still take 3 years, the only difference is you get garbage depictions.
>
> Condor can be persuaded fire up 92877507 EC2 VMs and run all of those in
> parallel -- provided you're willing to pay Amazon for it of course. If
> you can code the algorithm into GPGPU/SIMD parallel flow, you can
> probably push it into an FPGA and then get that baked into ASICs in
> China -- they'll give you discount if you order more than ten thousand.
> That gets you a $20 USB dongle that will run them at umpteen K/second.
> And so on.
>
> If you don't want quality depictions because bad ones will work just
> fine for your needs, that's a perfectly good argument. If you don't want
> them because generating 10M sequentially on a single core will take a
> long time, that's BS argument.
>
> Dima
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Dimitri Maziuk
On 2016-12-29 07:19, John M wrote:

> For why you need sub-second depiction consider these times for 92877507
> structures (current size PubChem Compound):
>
> 1s per structure = 1074 days (~3 years)
> 100 ms per structure = 107 days
> 1ms per structure = 25 hours

The Dilbert answer is buy a better computer. The serious answer is if 
you run millions of jobs sequentially on a single core, your problem is 
not how long a single job takes: no matter how fast you can make it, it 
will only scale linearly. There will be 1B compounds in PubChem two 
years from now and your painstakingly crafted 1ms/structure code will 
still take 3 years, the only difference is you get garbage depictions.

Condor can be persuaded fire up 92877507 EC2 VMs and run all of those in 
parallel -- provided you're willing to pay Amazon for it of course. If 
you can code the algorithm into GPGPU/SIMD parallel flow, you can 
probably push it into an FPGA and then get that baked into ASICs in 
China -- they'll give you discount if you order more than ten thousand. 
That gets you a $20 USB dongle that will run them at umpteen K/second. 
And so on.

If you don't want quality depictions because bad ones will work just 
fine for your needs, that's a perfectly good argument. If you don't want 
them because generating 10M sequentially on a single core will take a 
long time, that's BS argument.

Dima


--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Peter S. Shenkin
As a thought, it might make sense to consider a distinction between
publication-quality images and "pretty good" images. The latter require
speed and clarity, whereas a number of additional niceties (I hate to use
the word "elegance") would be highly desirable for the former, even at the
expense of speed. For example, for publication-quality images, one might
try to adhere more closely to the IUPAC recommendations for 2D depictions.

-P.



On Thu, Dec 29, 2016 at 8:53 AM, Brian Kelley  wrote:

> Perhaps we could train a ML algorithm to know which algorithm to use when
> :)
>
> Cheers,
>  Brian
>
> On Thu, Dec 29, 2016 at 8:19 AM, John M 
> wrote:
>
>> Hi Peter,
>>
>> I uploaded the benchmark set here: https://github.com/johnm
>> ay/layout-benchmark and have tested on their web service a few weeks
>> ago. IIRC it did seem quite slow, maybe fine for ahead of time generation
>> but not usable for on demand depiction. It does produce very nice
>> depictions but I think the right way to go is described by Alex Clark (2006
>> I think?) and used by MOE. Essentially use optimisation for certain
>> parts/classes of structure but not everything.
>>
>> Unfortunately no comparison to MOE/ChemDraw in the paper.
>>
>> For why you need sub-second depiction consider these times for 92877507
>> structures (current size PubChem Compound):
>>
>> 1s per structure = 1074 days (~3 years)
>> 100 ms per structure = 107 days
>> 1ms per structure = 25 hours
>>
>> John
>>
>> On 15 December 2016 at 23:12, Peter S. Shenkin  wrote:
>>
>>> Yes, of course, storing the images is an alternative.
>>>
>>> -P.
>>>
>>> On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziuk 
>>> wrote:
>>>
 On 12/15/2016 04:23 PM, Peter S. Shenkin wrote:

 > Obviously, it doesn't matter if you're rendering just few structures,
 but
 > in a scenario where you might be downloading a hundred SMILES from a
 DB and
 > displaying them on a grid in a browser, computing the 2D depictions
 on the
 > fly, waiting 5 sec for a page refresh wouldn't be great.

 Maybe not, but depending how the browser lays out the grid, it may take
 5 seconds anyway.

 My recommendation for that use case would be to pre-generate the images
 and store the URLs in that database. Which is what we do here.

 ;)
 --
 Dimitri Maziuk
 Programmer/sysadmin
 BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread Brian Kelley
Perhaps we could train a ML algorithm to know which algorithm to use when :)

Cheers,
 Brian

On Thu, Dec 29, 2016 at 8:19 AM, John M  wrote:

> Hi Peter,
>
> I uploaded the benchmark set here: https://github.com/
> johnmay/layout-benchmark and have tested on their web service a few weeks
> ago. IIRC it did seem quite slow, maybe fine for ahead of time generation
> but not usable for on demand depiction. It does produce very nice
> depictions but I think the right way to go is described by Alex Clark (2006
> I think?) and used by MOE. Essentially use optimisation for certain
> parts/classes of structure but not everything.
>
> Unfortunately no comparison to MOE/ChemDraw in the paper.
>
> For why you need sub-second depiction consider these times for 92877507
> structures (current size PubChem Compound):
>
> 1s per structure = 1074 days (~3 years)
> 100 ms per structure = 107 days
> 1ms per structure = 25 hours
>
> John
>
> On 15 December 2016 at 23:12, Peter S. Shenkin  wrote:
>
>> Yes, of course, storing the images is an alternative.
>>
>> -P.
>>
>> On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziuk 
>> wrote:
>>
>>> On 12/15/2016 04:23 PM, Peter S. Shenkin wrote:
>>>
>>> > Obviously, it doesn't matter if you're rendering just few structures,
>>> but
>>> > in a scenario where you might be downloading a hundred SMILES from a
>>> DB and
>>> > displaying them on a grid in a browser, computing the 2D depictions on
>>> the
>>> > fly, waiting 5 sec for a page refresh wouldn't be great.
>>>
>>> Maybe not, but depending how the browser lays out the grid, it may take
>>> 5 seconds anyway.
>>>
>>> My recommendation for that use case would be to pre-generate the images
>>> and store the URLs in that database. Which is what we do here.
>>>
>>> ;)
>>> --
>>> Dimitri Maziuk
>>> Programmer/sysadmin
>>> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>>>
>>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-29 Thread John M
Hi Peter,

I uploaded the benchmark set here:
https://github.com/johnmay/layout-benchmark and have tested on their web
service a few weeks ago. IIRC it did seem quite slow, maybe fine for ahead
of time generation but not usable for on demand depiction. It does produce
very nice depictions but I think the right way to go is described by Alex
Clark (2006 I think?) and used by MOE. Essentially use optimisation for
certain parts/classes of structure but not everything.

Unfortunately no comparison to MOE/ChemDraw in the paper.

For why you need sub-second depiction consider these times for 92877507
structures (current size PubChem Compound):

1s per structure = 1074 days (~3 years)
100 ms per structure = 107 days
1ms per structure = 25 hours

John

On 15 December 2016 at 23:12, Peter S. Shenkin  wrote:

> Yes, of course, storing the images is an alternative.
>
> -P.
>
> On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziuk 
> wrote:
>
>> On 12/15/2016 04:23 PM, Peter S. Shenkin wrote:
>>
>> > Obviously, it doesn't matter if you're rendering just few structures,
>> but
>> > in a scenario where you might be downloading a hundred SMILES from a DB
>> and
>> > displaying them on a grid in a browser, computing the 2D depictions on
>> the
>> > fly, waiting 5 sec for a page refresh wouldn't be great.
>>
>> Maybe not, but depending how the browser lays out the grid, it may take
>> 5 seconds anyway.
>>
>> My recommendation for that use case would be to pre-generate the images
>> and store the URLs in that database. Which is what we do here.
>>
>> ;)
>> --
>> Dimitri Maziuk
>> Programmer/sysadmin
>> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-15 Thread Peter S. Shenkin
Yes, of course, storing the images is an alternative.

-P.

On Thu, Dec 15, 2016 at 5:46 PM, Dimitri Maziuk 
wrote:

> On 12/15/2016 04:23 PM, Peter S. Shenkin wrote:
>
> > Obviously, it doesn't matter if you're rendering just few structures, but
> > in a scenario where you might be downloading a hundred SMILES from a DB
> and
> > displaying them on a grid in a browser, computing the 2D depictions on
> the
> > fly, waiting 5 sec for a page refresh wouldn't be great.
>
> Maybe not, but depending how the browser lays out the grid, it may take
> 5 seconds anyway.
>
> My recommendation for that use case would be to pre-generate the images
> and store the URLs in that database. Which is what we do here.
>
> ;)
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-15 Thread Dimitri Maziuk
On 12/15/2016 04:23 PM, Peter S. Shenkin wrote:

> Obviously, it doesn't matter if you're rendering just few structures, but
> in a scenario where you might be downloading a hundred SMILES from a DB and
> displaying them on a grid in a browser, computing the 2D depictions on the
> fly, waiting 5 sec for a page refresh wouldn't be great.

Maybe not, but depending how the browser lays out the grid, it may take
5 seconds anyway.

My recommendation for that use case would be to pre-generate the images
and store the URLs in that database. Which is what we do here.

;)
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-15 Thread Peter S. Shenkin
Well, Figure 10 shows that a molecule with about 25 heavy atoms takes about
50 ms to optimize.

In John Mayfield's UGM talk, it looks like CDK is taking an average of 1 ms
for "easy" structures and 56 ms for the hard ones, some of which are
depicted and have far more than 25 heavy atoms.

We don't know the details of the two data sets, so a head-to-head
comparison is tough, but intuitively, 20 structures/sec sounds slow.

Having said that, it's reasonable to pay a price in speed for additional
quality and robustness.

Obviously, it doesn't matter if you're rendering just few structures, but
in a scenario where you might be downloading a hundred SMILES from a DB and
displaying them on a grid in a browser, computing the 2D depictions on the
fly, waiting 5 sec for a page refresh wouldn't be great.

-P.

On Thu, Dec 15, 2016 at 4:22 PM, Dimitri Maziuk 
wrote:

> On 12/15/2016 02:53 PM, Peter S. Shenkin wrote:
> > Looks good, but maybe too slow for production use... (?)
>
> I wonder what kind of production use would require sub-second wall clock
> time for this.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-15 Thread Dimitri Maziuk
On 12/15/2016 02:53 PM, Peter S. Shenkin wrote:
> Looks good, but maybe too slow for production use... (?)

I wonder what kind of production use would require sub-second wall clock
time for this.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-15 Thread Peter S. Shenkin
Looks good, but maybe too slow for production use... (?)

-P.

On Thu, Dec 15, 2016 at 3:38 PM, Chris Swain  wrote:

> At first glance this looks an interesting approach.
>
> Simulation-Based Algorithm for Two-Dimensional Chemical Structure Diagram
> Generation of Complex Molecules and Ligand–Protein Interactions
> DOI: http://dx.doi.org/10.1021/acs.jcim.6b00391
>
> On 27 Sep 2016, at 05:38, rdkit-discuss-requ...@lists.sourceforge.net
> wrote:
>
> 2D drawing code is tough. The 90/10 rule applies: the last 10% of
> correctness takes 90% of the effort.
>
> I like Dmitri Agrafiotis's method, but IIRC it's patented; also, though
> it's good for rough work, it doesn't produce "beautiful" structural
> diagrams.
>
> Some of the 2D drawing methods that do produce "pretty" pictures have a
> large number of templates built in that match the most common (and even
> somewhat uncommon) motifs, and they fall down when they hit something they
> can't get a close enough match for. And then, the IUPAC has a whole list of
> "desirable" features in 2D diagrams (as in, "Don't show it this way, but
> rather show it that way."). So even if you produce what might appear to be
> an acceptable drawing, it might not match the IUPAC list of desirables.
>
> I think for the present purposes what we need is something correct, robust
> and legible, and of course the example shown does not exhibit that. (But I
> don't know what the starting SMILES is, so I don't know whether the
> 7-bonded C is due to a bad SMILES, in which case all bets are off.)
>
> In addition, I think some discussion earlier indicated that the RDKit 2D
> structures look much worse when H's are included.
>
> I actually wrote a code one time (while at Schr?dinger) to give a "badness"
> score to 2D structures. When our 2D depiction development was in progress,
> we created 2D SD files for many thousands of structures. I could put these
> through the program and sort with the worst on top. That allowed the most
> severe problems to be identified more quickly than, say, looking at
> thousands of 2D diagrams. The program looked at three things: Number of
> bonds that crossed, Number of atoms that were too close together, and Large
> disparity of bond lengths within the same molecule. (The checking code
> didn't deal with labels.)
>
> Writing the checker was a fun project, but I'm glad I didn't have to write
> the 2D depiction code. As Mark Twain said, "Improving oneself is good.
> Improving others is better ? and easier."
>
> -P.
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-12-15 Thread Chris Swain
At first glance this looks an interesting approach.

Simulation-Based Algorithm for Two-Dimensional Chemical Structure Diagram 
Generation of Complex Molecules and Ligand–Protein Interactions
DOI: http://dx.doi.org/10.1021/acs.jcim.6b00391

> On 27 Sep 2016, at 05:38, rdkit-discuss-requ...@lists.sourceforge.net wrote:
> 
> 2D drawing code is tough. The 90/10 rule applies: the last 10% of
> correctness takes 90% of the effort.
> 
> I like Dmitri Agrafiotis's method, but IIRC it's patented; also, though
> it's good for rough work, it doesn't produce "beautiful" structural
> diagrams.
> 
> Some of the 2D drawing methods that do produce "pretty" pictures have a
> large number of templates built in that match the most common (and even
> somewhat uncommon) motifs, and they fall down when they hit something they
> can't get a close enough match for. And then, the IUPAC has a whole list of
> "desirable" features in 2D diagrams (as in, "Don't show it this way, but
> rather show it that way."). So even if you produce what might appear to be
> an acceptable drawing, it might not match the IUPAC list of desirables.
> 
> I think for the present purposes what we need is something correct, robust
> and legible, and of course the example shown does not exhibit that. (But I
> don't know what the starting SMILES is, so I don't know whether the
> 7-bonded C is due to a bad SMILES, in which case all bets are off.)
> 
> In addition, I think some discussion earlier indicated that the RDKit 2D
> structures look much worse when H's are included.
> 
> I actually wrote a code one time (while at Schr?dinger) to give a "badness"
> score to 2D structures. When our 2D depiction development was in progress,
> we created 2D SD files for many thousands of structures. I could put these
> through the program and sort with the worst on top. That allowed the most
> severe problems to be identified more quickly than, say, looking at
> thousands of 2D diagrams. The program looked at three things: Number of
> bonds that crossed, Number of atoms that were too close together, and Large
> disparity of bond lengths within the same molecule. (The checking code
> didn't deal with labels.)
> 
> Writing the checker was a fun project, but I'm glad I didn't have to write
> the 2D depiction code. As Mark Twain said, "Improving oneself is good.
> Improving others is better ? and easier."
> 
> -P.

--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-09-27 Thread Dimitri Maziuk
On 2016-09-26 18:19, Peter S. Shenkin wrote:

> 2D drawing code is tough. The 90/10 rule applies: the last 10% of
> I think for the present purposes what we need is something correct,
> robust and legible, and of course the example shown does not exhibit
> that. (But I don't know what the starting SMILES is, so I don't know
> whether the 7-bonded C is due to a bad SMILES, in which case all bets
> are off.)

That was actually a "kudos to RDKit" post.

I have an application where I need a drawing with all Hs and all atom 
labels, and molecule description in mmCIF(-ish) format. I use RDKit for 
the latter because of OpenBabel's stereochemistry "model", and OpenBabel 
for the drawings because 90% of the time it generates better layouts.

THE comment is that RDKit's layout algorithm appears to be more stable: 
for this molecule OB generated a "better" picture from the original SDF 
downloaded from PubChem, and that complete mess when we re-ordered the 
atoms. RDKit generated the same picture in both cases. only one is a 
mirror image of the other.

Dima


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-09-26 Thread Peter S. Shenkin
2D drawing code is tough. The 90/10 rule applies: the last 10% of
correctness takes 90% of the effort.

I like Dmitri Agrafiotis's method, but IIRC it's patented; also, though
it's good for rough work, it doesn't produce "beautiful" structural
diagrams.

Some of the 2D drawing methods that do produce "pretty" pictures have a
large number of templates built in that match the most common (and even
somewhat uncommon) motifs, and they fall down when they hit something they
can't get a close enough match for. And then, the IUPAC has a whole list of
"desirable" features in 2D diagrams (as in, "Don't show it this way, but
rather show it that way."). So even if you produce what might appear to be
an acceptable drawing, it might not match the IUPAC list of desirables.

I think for the present purposes what we need is something correct, robust
and legible, and of course the example shown does not exhibit that. (But I
don't know what the starting SMILES is, so I don't know whether the
7-bonded C is due to a bad SMILES, in which case all bets are off.)

In addition, I think some discussion earlier indicated that the RDKit 2D
structures look much worse when H's are included.

I actually wrote a code one time (while at Schrödinger) to give a "badness"
score to 2D structures. When our 2D depiction development was in progress,
we created 2D SD files for many thousands of structures. I could put these
through the program and sort with the worst on top. That allowed the most
severe problems to be identified more quickly than, say, looking at
thousands of 2D diagrams. The program looked at three things: Number of
bonds that crossed, Number of atoms that were too close together, and Large
disparity of bond lengths within the same molecule. (The checking code
didn't deal with labels.)

Writing the checker was a fun project, but I'm glad I didn't have to write
the 2D depiction code. As Mark Twain said, "Improving oneself is good.
Improving others is better – and easier."

-P.

On Mon, Sep 26, 2016 at 5:54 PM, Dimitri Maziuk 
wrote:

> On 09/26/2016 04:42 PM, Peter S. Shenkin wrote:
> > Also, the C attached to H44 has an extra H (its own or someone else's?)
> > superimposed upon it.
>
> I wonder if 2D drawing code should really work the same way as the 3D
> conformer generation: generate a bunch of candidate layouts and pick the
> one(s) with least clashes/overlaps.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-09-26 Thread Dimitri Maziuk
On 09/26/2016 04:42 PM, Peter S. Shenkin wrote:
> Also, the C attached to H44 has an extra H (its own or someone else's?)
> superimposed upon it.

I wonder if 2D drawing code should really work the same way as the 3D
conformer generation: generate a bunch of candidate layouts and pick the
one(s) with least clashes/overlaps.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] drawing code take 3

2016-09-26 Thread Peter S. Shenkin
Also, the C attached to H44 has an extra H (its own or someone else's?)
superimposed upon it.

-P.

On Mon, Sep 26, 2016 at 5:38 PM, Dimitri Maziuk 
wrote:

>
> On the plus side, when drawing PubChem CID 5057 from a 3D SDF before and
> after our canonicalization, RDKit draws a mirror image, but otherwise
> the same 2D structure. OB's "after" version is attached: enjoy the
> 7-bond carbon in the ring.
>
> ;)
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] drawing code take 3

2016-09-26 Thread Dimitri Maziuk

On the plus side, when drawing PubChem CID 5057 from a 3D SDF before and
after our canonicalization, RDKit draws a mirror image, but otherwise
the same 2D structure. OB's "after" version is attached: enjoy the
7-bond carbon in the ring.

;)
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


signature.asc
Description: OpenPGP digital signature
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss