You're copying a lot of data between processes. Check out SharedArrays. But I still fear that if each "job" is tiny, you may not get as much benefit without further restructuring.
I trust that your "real" workload will take more than 1ms. Otherwise, it's very unlikely that your experiments in parallel programming will end up saving you time :-). --Tim On Wednesday, June 17, 2015 06:37:28 AM Daniel Carrera wrote: > Hi everyone, > > My adventures with parallel programming with Julia continue. Here is a > different issue from other threads: My parallel function is 8300x slower > than my serial function even though I am running on 4 processes on a > multi-core machine. > > julia> nprocs() > 4 > > I have Julia 0.3.8. Here is my program in its entirety (not very long). > > function main() > > nbig::Int16 = 7 > nbod::Int16 = nbig > bod = Float64[ > 0 1 2 3 4 5 6 # x position > 0 0 0 0 0 0 0 # y position > 0 0 0 0 0 0 0 # z position > 0 0 0 0 0 0 0 # x velocity > 0 0 0 0 0 0 0 # y velocity > 0 0 0 0 0 0 0 # z velocity > 1 1 1 1 1 1 1 # Mass > ] > > a = zeros(3,nbod) > > @time for k = 1:1000 > gravity_1!(bod, nbig, nbod, a) > end > println(a[1,:]) > > @time for k = 1:1000 > gravity_2!(bod, nbig, nbod, a) > end > println(a[1,:]) > end > > function gravity_1!(bod, nbig, nbod, a) > > for i = 1:nbod > a[1,i] = 0.0 > a[2,i] = 0.0 > a[3,i] = 0.0 > end > > @inbounds for i = 1:nbig > for j = (i + 1):nbod > > dx = bod[1,j] - bod[1,i] > dy = bod[2,j] - bod[2,i] > dz = bod[3,j] - bod[3,i] > > s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz) > s_3 = s_1 * s_1 * s_1 > > tmp1 = s_3 * bod[7,i] > tmp2 = s_3 * bod[7,j] > > a[1,j] = a[1,j] - tmp1*dx > a[2,j] = a[2,j] - tmp1*dy > a[3,j] = a[3,j] - tmp1*dz > > a[1,i] = a[1,i] + tmp2*dx > a[2,i] = a[2,i] + tmp2*dy > a[3,i] = a[3,i] + tmp2*dz > end > end > return a > end > > function gravity_2!(bod, nbig, nbod, a) > > for i = 1:nbod > a[1,i] = 0.0 > a[2,i] = 0.0 > a[3,i] = 0.0 > end > > @inbounds @sync @parallel for i = 1:nbig > for j = (i + 1):nbod > > dx = bod[1,j] - bod[1,i] > dy = bod[2,j] - bod[2,i] > dz = bod[3,j] - bod[3,i] > > s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz) > s_3 = s_1 * s_1 * s_1 > > tmp1 = s_3 * bod[7,i] > tmp2 = s_3 * bod[7,j] > > a[1,j] = a[1,j] - tmp1*dx > a[2,j] = a[2,j] - tmp1*dy > a[3,j] = a[3,j] - tmp1*dz > > a[1,i] = a[1,i] + tmp2*dx > a[2,i] = a[2,i] + tmp2*dy > a[3,i] = a[3,i] + tmp2*dz > end > end > return a > end > > > > So this is a straight forward N-body gravity calculation. Yes, I realize > that gravity_2!() is wrong, but that's fine. Right now I'm just talking > about the CPU time. When I run this on my computer I get: > > julia> main() > elapsed time: 0.000475294 seconds (0 bytes allocated) > [1.4913888888888889 0.4636111111111111 0.1736111111111111 > -5.551115123125783e-17 -0.17361111111111116 -0.4636111111111112 > -1.4913888888888889] > elapsed time: 3.953546654 seconds (126156320 bytes allocated, 13.49% gc > time) > [0.0 0.0 0.0 0.0 0.0 0.0 0.0] > > > So, the serial version takes 0.000475 seconds and the parallel takes 3.95 > seconds. Furthermore, the parallel version is calling the garbage > collector. I suspect that the problem has something to do with the memory > access. Maybe the parallel code is wasting a lot of time copying variables > in memory. But whatever the reason, this is bad. The documentation says > that @parallel is supposed to be fast, even for very small loops, but > that's not what I'm seeing. A non-buggy implementation will be even slower. > > Have I missed something? Is there an obvious error in how I'm using the > parallel constructs? > > I would appreciate any guidance you may offer. > > Cheers, > Daniel.
