On Monday, 19 February 2018 at 05:49:54 UTC, Nicholas Wilson
wrote:
As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache
for 32bit) it may be that dealing with the concurrency overhead
adds a significant amount of overhead.
That 'concurrency overhead' is what i´m not getting.
Since the array is big, dividing it into 6, 7 cores will not
trash L1 since they are very far from each other, right? Or L2
cache trashing is also a problem in this case?
_base : 150 ms, 728 μs, and 5 hnsecs
_parallel : 120 ms, 78 μs, and 5 hnsecs
_concurrency : 134 ms, 787 μs, and 4 hnsecs
_thread : 129 ms, 476 μs, and 2 hnsecs
Yes, on my PC I was using -release.
Yet, 150ms for 1 core. 120-134ms of X cores.
Shouldn´t be way faster? I´m trying to understand where the
overhead is, and if is possible to get rid of it (perfect thread
scaling).