You're damn right here. First of all, I must admit that I've misinterpreted the 
benchmark results (guilty). Yet, anyway, I think I know what's really happening 
here. To make things really clear I ran the benchmark for all number of workers 
from 1 to 9. Here is a cleaned up output:

     Timing 1 iterations of worker1, worker2, worker3, worker4, worker5, 
worker6, worker7, worker8, worker9...
        worker1: 22.125 wallclock secs (22.296 usr 0.248 sys 22.544 cpu) @ 
0.045/s (n=1)
        worker2: 12.554 wallclock secs (24.221 usr 0.715 sys 24.936 cpu) @ 
0.080/s (n=1)
        worker3: 9.330 wallclock secs (25.708 usr 1.316 sys 27.024 cpu) @ 
0.107/s (n=1)
        worker4: 8.221 wallclock secs (28.151 usr 2.676 sys 30.827 cpu) @ 
0.122/s (n=1)
        worker5: 7.131 wallclock secs (30.395 usr 3.658 sys 34.053 cpu) @ 
0.140/s (n=1)
        worker6: 7.180 wallclock secs (34.496 usr 4.479 sys 38.975 cpu) @ 
0.139/s (n=1)
        worker7: 7.050 wallclock secs (38.267 usr 5.453 sys 43.720 cpu) @ 
0.142/s (n=1)
        worker8: 6.668 wallclock secs (41.607 usr 5.586 sys 47.194 cpu) @ 
0.150/s (n=1)
        worker9: 7.220 wallclock secs (46.762 usr 11.647 sys 58.409 cpu) @ 
0.139/s (n=1)
     O---------O----------O---------O
     |         | s/iter   | worker1 |
     O=========O==========O=========O
     | worker1 | 22125229 | --      |
     | worker2 | 12554094 | 76%     |
     | worker3 | 9329865  | 137%    |
     | worker4 | 8221486  | 169%    |
     | worker5 | 7130758  | 210%    |
     | worker6 | 7180343  | 208%    |
     | worker7 | 7049935  | 214%    |
     | worker8 | 6667794  | 232%    |
     | worker9 | 7219864  | 206%    |
     --------------------------------

The plateau is there but it's been reached even before we ran out of all the 
available cores: 5 workers takes all of the CPU power already. Yet, the speedup 
achieved is really much less that it'd expected... But then I realized that 
there is another player on the field: throttling. And that actually makes any 
other measurements useless on my notebook.

This is also an answer to Parrot's suggestion about possible caches 
involvement: that's not it, for sure. Especially if we take into account that 
the numbers were +/- the same on every benchmark run.

> On Dec 7, 2018, at 12:04, yary <[email protected]> wrote:
> 
> OK... going back to the hypothesis in the OP
> 
>> The plateau is seemingly defined by the number of cores or, more correctly, 
>> by the number of supported threads.
> 
> This suggests that the benchmark is CPU-bound, which is supported by
> your more recent observation "100% load for a single one"
> 
> Also, you mentioned running MacOS with two threads per core, which
> implies Intel's hyperthreading. Depending on the workload, CPU-bound
> processes sharing a hyperthreaded core see a speedup of 0-30%, as
> opposed to running on separate cores which can give a speedup of 100%.
> (Back when I searched for large primes, HT gave a 25% speed boost.) So
> with 6 cores, 2 HT per core, I would expect a max parallel boost of 6
> * (1x +0.30x) = 7.8x - and your test is only giving half that.
> 
> -y
> 

Best regards,
Vadim Belman

Reply via email to