There is not need for filling in the channel prior to starting workers. First
of all, 100 repetitions of SHA256 per worker makes takes ~0.7sec on my system.
I didn't do benchmarking of the generator thread, but considering that even
your timing gives 0.054sec/per string – I will most definitely remain fast
enough to provide all workers with data. But even with this in mind I re-run
the test with only 100 characters long strings being generated. Here is what
I've got:
Benchmark:
Timing 1 iterations of workers1, workers10, workers15, workers2, workers3,
workers5...
workers1: 22.473 wallclock secs (22.609 usr 0.231 sys 22.840 cpu) @ 0.044/s
(n=1)
(warning: too few iterations for a reliable count)
workers10: 6.154 wallclock secs (44.087 usr 11.149 sys 55.236 cpu) @ 0.162/s
(n=1)
(warning: too few iterations for a reliable count)
workers15: 6.165 wallclock secs (50.206 usr 9.540 sys 59.745 cpu) @ 0.162/s
(n=1)
(warning: too few iterations for a reliable count)
workers2: 14.102 wallclock secs (26.524 usr 0.618 sys 27.142 cpu) @ 0.071/s
(n=1)
(warning: too few iterations for a reliable count)
workers3: 10.553 wallclock secs (27.808 usr 1.404 sys 29.213 cpu) @ 0.095/s
(n=1)
(warning: too few iterations for a reliable count)
workers5: 7.650 wallclock secs (31.099 usr 3.803 sys 34.902 cpu) @ 0.131/s
(n=1)
(warning: too few iterations for a reliable count)
O-----------O----------O----------O-----------O----------O-----------O----------O----------O
| | s/iter | workers3 | workers15 | workers5 | workers10 | workers2
| workers1 |
O===========O==========O==========O===========O==========O===========O==========O==========O
| workers3 | 10553022 | -- | -42% | -28% | -42% | 34%
| 113% |
| workers15 | 6165235 | 71% | -- | 24% | -0% | 129%
| 265% |
| workers5 | 7650413 | 38% | -19% | -- | -20% | 84%
| 194% |
| workers10 | 6154300 | 71% | 0% | 24% | -- | 129%
| 265% |
| workers2 | 14101512 | -25% | -56% | -46% | -56% | --
| 59% |
| workers1 | 22473185 | -53% | -73% | -66% | -73% | -37%
| -- |
--------------------------------------------------------------------------------------------
What's more important is the observation for the CPU consumption by the moar
process. Depending on the number of workers I was getting numbers from 100%
load for a single one up to 1000% for the whole bunch of 15. This perfectly
corresponds with 6 cores/2 threads per core of my CPU.
> On Dec 7, 2018, at 02:06, yary <[email protected]> wrote:
>
> That was a bit vague- meant that I suspect the workers are being
> starved, since you have many consumers, and only a single thread
> generating the 1k strings. I would prime the channel to be full - or
> other restructuring the ensure all threads are kept busy.
>
> -y
>
> On Thu, Dec 6, 2018 at 10:56 PM yary <[email protected]> wrote:
>>
>> Not sure if your test is measuring what you expect- the setup of
>> generating 50 x 1k strings is taking 2.7sec on my laptop, and that's
>> reducing the apparent effect of parllelism.
>>
>> $ perl6
>> To exit type 'exit' or '^D'
>>> my $c = Channel.new;
>> Channel.new
>>> { for 1..50 {$c.send((1..1024).map( { (' '..'Z').pick } ).join);}; say now
>>> - ENTER now; }
>> 2.7289092
>>
>> I'd move the setup outside the "cmpthese" and try again, re-think the
>> new results.
>>
>>
>>
>> On 12/6/18, Vadim Belman <[email protected]> wrote:
>>> Hi everybody!
>>>
>>> I have recently played a bit with somewhat intense computations and tried to
>>> parallelize them among a couple of threaded workers. The results were
>>> somewhat... eh... discouraging. To sum up my findings I wrote a simple demo
>>> benchmark:
>>>
>>> use Digest::SHA;
>>> use Bench;
>>>
>>> sub worker ( Str:D $str ) {
>>> my $digest = $str;
>>>
>>> for 1..100 {
>>> $digest = sha256 $digest;
>>> }
>>> }
>>>
>>> sub run ( Int $workers ) {
>>> my $c = Channel.new;
>>>
>>> my @w;
>>> @w.push: start {
>>> for 1..50 {
>>> $c.send(
>>> (1..1024).map( { (' '..'Z').pick } ).join
>>> );
>>> }
>>> LEAVE $c.close;
>>> }
>>>
>>> for 1..$workers {
>>> @w.push: start {
>>> react {
>>> whenever $c -> $str {
>>> worker( $str );
>>> }
>>> }
>>> }
>>> }
>>>
>>> await @w;
>>> }
>>>
>>> my $b = Bench.new;
>>> $b.cmpthese(
>>> 1,
>>> {
>>> workers1 => sub { run( 1 ) },
>>> workers5 => sub { run( 5 ) },
>>> workers10 => sub { run( 10 ) },
>>> workers15 => sub { run( 15 ) },
>>> }
>>> );
>>>
>>> I tried this code with a macOS installation of Rakudo and with a Linux in a
>>> VM box. Here is macOS results (6 CPU cores):
>>>
>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>> workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @
>>> 0.037/s (n=1)
>>> (warning: too few iterations for a reliable count)
>>> workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @
>>> 0.133/s (n=1)
>>> (warning: too few iterations for a reliable count)
>>> workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ 0.126/s
>>> (n=1)
>>> (warning: too few iterations for a reliable count)
>>> workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ 0.106/s
>>> (n=1)
>>> (warning: too few iterations for a reliable count)
>>> O-----------O----------O----------O-----------O-----------O----------O
>>> | | s/iter | workers1 | workers10 | workers15 | workers5 |
>>> O===========O==========O==========O===========O===========O==========O
>>> | workers1 | 27176370 | -- | -72% | -71% | -65% |
>>> | workers10 | 7503726 | 262% | -- | 6% | 26% |
>>> | workers15 | 7938428 | 242% | -5% | -- | 19% |
>>> | workers5 | 9452421 | 188% | -21% | -16% | -- |
>>> ----------------------------------------------------------------------
>>>
>>> And Linux (4 virtual cores):
>>>
>>> Timing 1 iterations of workers1, workers10, workers15, workers5...
>>> workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @
>>> 0.037/s (n=1)
>>> (warning: too few iterations for a reliable count)
>>> workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @
>>> 0.097/s (n=1)
>>> (warning: too few iterations for a reliable count)
>>> workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @
>>> 0.098/s (n=1)
>>> (warning: too few iterations for a reliable count)
>>> workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @
>>> 0.094/s (n=1)
>>> (warning: too few iterations for a reliable count)
>>> O-----------O----------O----------O----------O-----------O-----------O
>>> | | s/iter | workers5 | workers1 | workers15 | workers10 |
>>> O===========O==========O==========O==========O===========O===========O
>>> | workers5 | 10663102 | -- | 155% | -4% | -3% |
>>> | workers1 | 27240221 | -61% | -- | -62% | -62% |
>>> | workers15 | 10220862 | 4% | 167% | -- | 1% |
>>> | workers10 | 10338829 | 3% | 163% | -1% | -- |
>>> ----------------------------------------------------------------------
>>>
>>> Am I missing something here? Do I do something wrong? Because it just
>>> doesn't fit into my mind...
>>>
>>> As a side done: by playing with 1-2-3 workers I see that each new thread
>>> gradually adds atop of the total run time until a plato is reached. The
>>> plato is seemingly defined by the number of cores or, more correctly, by the
>>> number of supported threads. Proving this hypothesis wold require more time
>>> than I have on my hands right now. And not even sure if such proof ever
>>> makes sense.
>>>
>>> Best regards,
>>> Vadim Belman
>>>
>>>
>>
>>
>> --
>> -y
>
Best regards,
Vadim Belman