Performance of parallel computing.

Vadim Belman Thu, 06 Dec 2018 17:39:57 -0800

Hi everybody!

I have recently played a bit with somewhat intense computations and tried to 
parallelize them among a couple of threaded workers. The results were 
somewhat... eh... discouraging. To sum up my findings I wrote a simple demo 
benchmark:


     use Digest::SHA;
     use Bench;

     sub worker ( Str:D $str ) {
         my $digest = $str;

         for 1..100 {
             $digest = sha256 $digest;
         }
     }

     sub run ( Int $workers ) {
         my $c = Channel.new;

         my @w;
         @w.push: start {
             for 1..50 {
                 $c.send(
                     (1..1024).map( { (' '..'Z').pick } ).join
                 );
             }
             LEAVE $c.close;
         }

         for 1..$workers {
             @w.push: start {
                 react {
                     whenever $c -> $str {
                         worker( $str );
                     }
                 }
             }
         }

         await @w;
     }

     my $b = Bench.new;
     $b.cmpthese(
         1,
         {
             workers1 => sub { run( 1 ) },
             workers5 => sub { run( 5 ) },
             workers10 => sub { run( 10 ) },
             workers15 => sub { run( 15 ) },
         }
     );

I tried this code with a macOS installation of Rakudo and with a Linux in a VM 
box. Here is macOS results (6 CPU cores):

Timing 1 iterations of workers1, workers10, workers15, workers5...
  workers1: 27.176 wallclock secs (28.858 usr 0.348 sys 29.206 cpu) @ 0.037/s 
(n=1)
                (warning: too few iterations for a reliable count)
 workers10: 7.504 wallclock secs (56.903 usr 10.127 sys 67.030 cpu) @ 0.133/s 
(n=1)
                (warning: too few iterations for a reliable count)
 workers15: 7.938 wallclock secs (63.357 usr 9.483 sys 72.840 cpu) @ 0.126/s 
(n=1)
                (warning: too few iterations for a reliable count)
  workers5: 9.452 wallclock secs (40.185 usr 4.807 sys 44.992 cpu) @ 0.106/s 
(n=1)
                (warning: too few iterations for a reliable count)
O-----------O----------O----------O-----------O-----------O----------O
|           | s/iter   | workers1 | workers10 | workers15 | workers5 |
O===========O==========O==========O===========O===========O==========O
| workers1  | 27176370 | --       | -72%      | -71%      | -65%     |
| workers10 | 7503726  | 262%     | --        | 6%        | 26%      |
| workers15 | 7938428  | 242%     | -5%       | --        | 19%      |
| workers5  | 9452421  | 188%     | -21%      | -16%      | --       |
----------------------------------------------------------------------

And Linux (4 virtual cores):

Timing 1 iterations of workers1, workers10, workers15, workers5...
  workers1: 27.240 wallclock secs (29.143 usr 0.129 sys 29.272 cpu) @ 0.037/s 
(n=1)
                (warning: too few iterations for a reliable count)
 workers10: 10.339 wallclock secs (37.964 usr 0.611 sys 38.575 cpu) @ 0.097/s 
(n=1)
                (warning: too few iterations for a reliable count)
 workers15: 10.221 wallclock secs (35.452 usr 1.432 sys 36.883 cpu) @ 0.098/s 
(n=1)
                (warning: too few iterations for a reliable count)
  workers5: 10.663 wallclock secs (36.983 usr 0.848 sys 37.831 cpu) @ 0.094/s 
(n=1)
                (warning: too few iterations for a reliable count)
O-----------O----------O----------O----------O-----------O-----------O
|           | s/iter   | workers5 | workers1 | workers15 | workers10 |
O===========O==========O==========O==========O===========O===========O
| workers5  | 10663102 | --       | 155%     | -4%       | -3%       |
| workers1  | 27240221 | -61%     | --       | -62%      | -62%      |
| workers15 | 10220862 | 4%       | 167%     | --        | 1%        |
| workers10 | 10338829 | 3%       | 163%     | -1%       | --        |
----------------------------------------------------------------------

Am I missing something here? Do I do something wrong? Because it just doesn't 
fit into my mind...

As a side done: by playing with 1-2-3 workers I see that each new thread 
gradually adds atop of the total run time until a plato is reached. The plato 
is seemingly defined by the number of cores or, more correctly, by the number 
of supported threads. Proving this hypothesis wold require more time than I 
have on my hands right now. And not even sure if such proof ever makes sense.

Best regards,
Vadim Belman

Performance of parallel computing.

Reply via email to