I've parallelized some code with @threads, but instead of a factor NCPUs
speed improvement (for me, 8), I'm seeing rather a bit under a factor 2. I
suppose the answer may be that my bottleneck isn't computation, rather
memory access. But during running the code, I see my CPU usage go to 100%
on all 8 CPUs, if it were memory access would I still see this? Maybe the
answer is yes, in which case memory access is likely the culprit; is there
some way to confirm this though? If no, how do I figure out what *is* the
culprit?
Here's a stripped down version of my code,
function test(nl,np)
inv_cl = ones(3,3,nl)
d_cl = Dict(i => ones(3,3,nl) for i=1:np)
fish = zeros(np,np)
ijs = [(i,j) for i=1:np, j=1:np]
Threads.@threads for ij in ijs
i,j = ij
for l in 1:nl
fish[i,j] += (2*l+1)/2*trace(inv_cl[:,:,l]*d_cl[i][:,:,l]*inv_cl
[:,:,l]*d_cl[j][:,:,l])
end
end
end
# with the @threads
@timeit test(3000,40)
1 loops, best of 3: 3.17 s per loop
# now remove the @threads from above
@timeit test(3000,40)
1 loops, best of 3: 4.42 s per loop
Thanks.