On Monday, 19 February 2018 at 05:49:54 UTC, Nicholas Wilson wrote:
As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache for 32bit) it may be that dealing with the concurrency overhead adds a significant amount of overhead.

That 'concurrency overhead' is what i´m not getting.
Since the array is big, dividing it into 6, 7 cores will not trash L1 since they are very far from each other, right? Or L2 cache trashing is also a problem in this case?

_base : 150 ms, 728 μs, and 5 hnsecs
_parallel : 120 ms, 78 μs, and 5 hnsecs
_concurrency : 134 ms, 787 μs, and 4 hnsecs
_thread : 129 ms, 476 μs, and 2 hnsecs


Yes, on my PC I was using -release.

Yet, 150ms for 1 core. 120-134ms of X cores.
Shouldn´t be way faster? I´m trying to understand where the overhead is, and if is possible to get rid of it (perfect thread scaling).

Reply via email to