Hi everybody!
I have recently played a bit with somewhat intense computations and tried to
parallelize them among a couple of threaded workers. The results were
somewhat... eh... discouraging. To sum up my findings I wrote a simple demo
benchmark:
use Digest::SHA;
use Bench;
sub worker ( Str:D $str ) {
my $digest = $str;
for 1..100 {
$digest = sha256 $digest;
}
}
sub run ( Int $workers ) {
my $c = Channel.new;
my @w;
@w.push: start {
for 1..50 {
$c.send(
(1..1024).map( { (' '..'Z').pick } ).join
);
}
LEAVE $c.close;
}
for 1..$workers {
@w.push: start {
react {
whenever $c -> $str {
worker( $str );
}
}
}
}
await @w;
}
my $b = Bench.new;
$b.cmpthese(
1,
{
workers1 => sub { run( 1 ) },
workers5 => sub { run( 5 ) },
workers10 => sub { run( 10 ) },
workers15 => sub { run( 15 ) },
}
);
I tried this code with a macOS installation of Rakudo and with a Linux in a VM
box. Here is macOS results (6 CPU cores):
Timing 1 iterations of workers1, workers10, workers15, workers5...
workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @ 0.037/s
(n=1)
(warning: too few iterations for a reliable count)
workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @ 0.133/s
(n=1)
(warning: too few iterations for a reliable count)
workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ 0.126/s
(n=1)
(warning: too few iterations for a reliable count)
workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ 0.106/s
(n=1)
(warning: too few iterations for a reliable count)
O-----------O----------O----------O-----------O-----------O----------O
| | s/iter | workers1 | workers10 | workers15 | workers5 |
O===========O==========O==========O===========O===========O==========O
| workers1 | 27176370 | -- | -72% | -71% | -65% |
| workers10 | 7503726 | 262% | -- | 6% | 26% |
| workers15 | 7938428 | 242% | -5% | -- | 19% |
| workers5 | 9452421 | 188% | -21% | -16% | -- |
----------------------------------------------------------------------
And Linux (4 virtual cores):
Timing 1 iterations of workers1, workers10, workers15, workers5...
workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @ 0.037/s
(n=1)
(warning: too few iterations for a reliable count)
workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @ 0.097/s
(n=1)
(warning: too few iterations for a reliable count)
workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @ 0.098/s
(n=1)
(warning: too few iterations for a reliable count)
workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @ 0.094/s
(n=1)
(warning: too few iterations for a reliable count)
O-----------O----------O----------O----------O-----------O-----------O
| | s/iter | workers5 | workers1 | workers15 | workers10 |
O===========O==========O==========O==========O===========O===========O
| workers5 | 10663102 | -- | 155% | -4% | -3% |
| workers1 | 27240221 | -61% | -- | -62% | -62% |
| workers15 | 10220862 | 4% | 167% | -- | 1% |
| workers10 | 10338829 | 3% | 163% | -1% | -- |
----------------------------------------------------------------------
Am I missing something here? Do I do something wrong? Because it just doesn't
fit into my mind...
As a side done: by playing with 1-2-3 workers I see that each new thread
gradually adds atop of the total run time until a plato is reached. The plato
is seemingly defined by the number of cores or, more correctly, by the number
of supported threads. Proving this hypothesis wold require more time than I
have on my hands right now. And not even sure if such proof ever makes sense.
Best regards,
Vadim Belman