On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:
On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky
The operation is trivial and dataset is rather small. In such
cases SIMD with eg array ops is the way to go:
result = values * values2;
Yes, absolutely right :)
I make a simple example to understand why the threads are not
scaling in the way i thought they would.
Yeah, the world is ugly place where trivial math sometimes
I suggest to:
- run with different number of threads from 1 to n
- vary sizes from 100k to 10m
For your numbers - 400ms / 64 is ~ 6ms, if we divide by # cores
it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice.
In essence a single core runs fast b/c it doesn’t wait for all
others to complete via join easily burning its quota in one go.
In MT I bet some of overhead comes from not all threads finishing
(and starting) at once, so the join block in the kernel.
You could run your MT code with strace to see if it hits the
futex call or some such, if it does that’s where you are getting
delays. (that’s assuming you are on Linux)
std.parallel version is a bit faster b/c I think it caches
created threadpool so you don’t start threads anew on each run.
I imagine that, if one core work is done in 200ms a 4 core work
will be done in 50ms, plus some overhead, since they are
working on separate block of memory, without need of sync, and
without false sharing, etc (at least I think i don´t have this
If you had a long queue of small tasks like that and you don’t
wait to join all threads untill absolutely required you get near
perfect scalability. (Unless hitting other bottlenecks like RAM).