You're damn right here. First of all, I must admit that I've misinterpreted the
benchmark results (guilty). Yet, anyway, I think I know what's really happening
here. To make things really clear I ran the benchmark for all number of workers
from 1 to 9. Here is a cleaned up output:
Timing 1 iterations of worker1, worker2, worker3, worker4, worker5,
worker6, worker7, worker8, worker9...
worker1: 22.125 wallclock secs (22.296 usr 0.248 sys 22.544 cpu) @
0.045/s (n=1)
worker2: 12.554 wallclock secs (24.221 usr 0.715 sys 24.936 cpu) @
0.080/s (n=1)
worker3: 9.330 wallclock secs (25.708 usr 1.316 sys 27.024 cpu) @
0.107/s (n=1)
worker4: 8.221 wallclock secs (28.151 usr 2.676 sys 30.827 cpu) @
0.122/s (n=1)
worker5: 7.131 wallclock secs (30.395 usr 3.658 sys 34.053 cpu) @
0.140/s (n=1)
worker6: 7.180 wallclock secs (34.496 usr 4.479 sys 38.975 cpu) @
0.139/s (n=1)
worker7: 7.050 wallclock secs (38.267 usr 5.453 sys 43.720 cpu) @
0.142/s (n=1)
worker8: 6.668 wallclock secs (41.607 usr 5.586 sys 47.194 cpu) @
0.150/s (n=1)
worker9: 7.220 wallclock secs (46.762 usr 11.647 sys 58.409 cpu) @
0.139/s (n=1)
O---------O----------O---------O
| | s/iter | worker1 |
O=========O==========O=========O
| worker1 | 22125229 | -- |
| worker2 | 12554094 | 76% |
| worker3 | 9329865 | 137% |
| worker4 | 8221486 | 169% |
| worker5 | 7130758 | 210% |
| worker6 | 7180343 | 208% |
| worker7 | 7049935 | 214% |
| worker8 | 6667794 | 232% |
| worker9 | 7219864 | 206% |
--------------------------------
The plateau is there but it's been reached even before we ran out of all the
available cores: 5 workers takes all of the CPU power already. Yet, the speedup
achieved is really much less that it'd expected... But then I realized that
there is another player on the field: throttling. And that actually makes any
other measurements useless on my notebook.
This is also an answer to Parrot's suggestion about possible caches
involvement: that's not it, for sure. Especially if we take into account that
the numbers were +/- the same on every benchmark run.
> On Dec 7, 2018, at 12:04, yary <[email protected]> wrote:
>
> OK... going back to the hypothesis in the OP
>
>> The plateau is seemingly defined by the number of cores or, more correctly,
>> by the number of supported threads.
>
> This suggests that the benchmark is CPU-bound, which is supported by
> your more recent observation "100% load for a single one"
>
> Also, you mentioned running MacOS with two threads per core, which
> implies Intel's hyperthreading. Depending on the workload, CPU-bound
> processes sharing a hyperthreaded core see a speedup of 0-30%, as
> opposed to running on separate cores which can give a speedup of 100%.
> (Back when I searched for large primes, HT gave a 25% speed boost.) So
> with 6 cores, 2 HT per core, I would expect a max parallel boost of 6
> * (1x +0.30x) = 7.8x - and your test is only giving half that.
>
> -y
>
Best regards,
Vadim Belman