On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:
On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky wrote:
The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go:
result[] = values[] * values2[];


Yes, absolutely right :)

I make a simple example to understand why the threads are not scaling in the way i thought they would.

Yeah, the world is ugly place where trivial math sometimes doesn’t work.

I suggest to:
- run with different number of threads from 1 to n
- vary sizes from 100k to 10m

For your numbers - 400ms / 64 is ~ 6ms, if we divide by # cores it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice.

In essence a single core runs fast b/c it doesn’t wait for all others to complete via join easily burning its quota in one go. In MT I bet some of overhead comes from not all threads finishing (and starting) at once, so the join block in the kernel.

You could run your MT code with strace to see if it hits the futex call or some such, if it does that’s where you are getting delays. (that’s assuming you are on Linux)

std.parallel version is a bit faster b/c I think it caches created threadpool so you don’t start threads anew on each run.

I imagine that, if one core work is done in 200ms a 4 core work will be done in 50ms, plus some overhead, since they are working on separate block of memory, without need of sync, and without false sharing, etc (at least I think i don´t have this problem here).

If you had a long queue of small tasks like that and you don’t wait to join all threads untill absolutely required you get near perfect scalability. (Unless hitting other bottlenecks like RAM).



Reply via email to