Can you arrange the problem so that you send each CPU a few seconds of work? The overhead would become negligible.
-- mb On Wed, Jun 17, 2015 at 4:38 PM, Daniel Carrera <[email protected]> wrote: > Sadly, this function is pretty close to the real workload. I do n-body > simulations of planetary systems. In this problem, the value of "n" is > small, but the simulation runs for a very long time. So this nested for > loop will be called maybe 10 billion times in the course of one simulation, > and that's where all the CPU time goes. > > I already have simulation software that works well enough for this. I just > wanted to experiment with Julia to see if this could be made parallel. An > irritating problem with all the codes that solve planetary systems is that > they are all serial -- this problem is apparently hard to parallelize. > > > Cheers, > Daniel. > > > > On 17 June 2015 at 17:02, Tim Holy <[email protected]> wrote: > >> You're copying a lot of data between processes. Check out SharedArrays. >> But I >> still fear that if each "job" is tiny, you may not get as much benefit >> without >> further restructuring. >> >> I trust that your "real" workload will take more than 1ms. Otherwise, it's >> very unlikely that your experiments in parallel programming will end up >> saving >> you time :-). >> >> --Tim >> >> On Wednesday, June 17, 2015 06:37:28 AM Daniel Carrera wrote: >> > Hi everyone, >> > >> > My adventures with parallel programming with Julia continue. Here is a >> > different issue from other threads: My parallel function is 8300x slower >> > than my serial function even though I am running on 4 processes on a >> > multi-core machine. >> > >> > julia> nprocs() >> > 4 >> > >> > I have Julia 0.3.8. Here is my program in its entirety (not very long). >> > >> > function main() >> > >> > nbig::Int16 = 7 >> > nbod::Int16 = nbig >> > bod = Float64[ >> > 0 1 2 3 4 5 6 # x position >> > 0 0 0 0 0 0 0 # y position >> > 0 0 0 0 0 0 0 # z position >> > 0 0 0 0 0 0 0 # x velocity >> > 0 0 0 0 0 0 0 # y velocity >> > 0 0 0 0 0 0 0 # z velocity >> > 1 1 1 1 1 1 1 # Mass >> > ] >> > >> > a = zeros(3,nbod) >> > >> > @time for k = 1:1000 >> > gravity_1!(bod, nbig, nbod, a) >> > end >> > println(a[1,:]) >> > >> > @time for k = 1:1000 >> > gravity_2!(bod, nbig, nbod, a) >> > end >> > println(a[1,:]) >> > end >> > >> > function gravity_1!(bod, nbig, nbod, a) >> > >> > for i = 1:nbod >> > a[1,i] = 0.0 >> > a[2,i] = 0.0 >> > a[3,i] = 0.0 >> > end >> > >> > @inbounds for i = 1:nbig >> > for j = (i + 1):nbod >> > >> > dx = bod[1,j] - bod[1,i] >> > dy = bod[2,j] - bod[2,i] >> > dz = bod[3,j] - bod[3,i] >> > >> > s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz) >> > s_3 = s_1 * s_1 * s_1 >> > >> > tmp1 = s_3 * bod[7,i] >> > tmp2 = s_3 * bod[7,j] >> > >> > a[1,j] = a[1,j] - tmp1*dx >> > a[2,j] = a[2,j] - tmp1*dy >> > a[3,j] = a[3,j] - tmp1*dz >> > >> > a[1,i] = a[1,i] + tmp2*dx >> > a[2,i] = a[2,i] + tmp2*dy >> > a[3,i] = a[3,i] + tmp2*dz >> > end >> > end >> > return a >> > end >> > >> > function gravity_2!(bod, nbig, nbod, a) >> > >> > for i = 1:nbod >> > a[1,i] = 0.0 >> > a[2,i] = 0.0 >> > a[3,i] = 0.0 >> > end >> > >> > @inbounds @sync @parallel for i = 1:nbig >> > for j = (i + 1):nbod >> > >> > dx = bod[1,j] - bod[1,i] >> > dy = bod[2,j] - bod[2,i] >> > dz = bod[3,j] - bod[3,i] >> > >> > s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz) >> > s_3 = s_1 * s_1 * s_1 >> > >> > tmp1 = s_3 * bod[7,i] >> > tmp2 = s_3 * bod[7,j] >> > >> > a[1,j] = a[1,j] - tmp1*dx >> > a[2,j] = a[2,j] - tmp1*dy >> > a[3,j] = a[3,j] - tmp1*dz >> > >> > a[1,i] = a[1,i] + tmp2*dx >> > a[2,i] = a[2,i] + tmp2*dy >> > a[3,i] = a[3,i] + tmp2*dz >> > end >> > end >> > return a >> > end >> > >> > >> > >> > So this is a straight forward N-body gravity calculation. Yes, I realize >> > that gravity_2!() is wrong, but that's fine. Right now I'm just talking >> > about the CPU time. When I run this on my computer I get: >> > >> > julia> main() >> > elapsed time: 0.000475294 seconds (0 bytes allocated) >> > [1.4913888888888889 0.4636111111111111 0.1736111111111111 >> > -5.551115123125783e-17 -0.17361111111111116 -0.4636111111111112 >> > -1.4913888888888889] >> > elapsed time: 3.953546654 seconds (126156320 bytes allocated, 13.49% gc >> > time) >> > [0.0 0.0 0.0 0.0 0.0 0.0 0.0] >> > >> > >> > So, the serial version takes 0.000475 seconds and the parallel takes >> 3.95 >> > seconds. Furthermore, the parallel version is calling the garbage >> > collector. I suspect that the problem has something to do with the >> memory >> > access. Maybe the parallel code is wasting a lot of time copying >> variables >> > in memory. But whatever the reason, this is bad. The documentation says >> > that @parallel is supposed to be fast, even for very small loops, but >> > that's not what I'm seeing. A non-buggy implementation will be even >> slower. >> > >> > Have I missed something? Is there an obvious error in how I'm using the >> > parallel constructs? >> > >> > I would appreciate any guidance you may offer. >> > >> > Cheers, >> > Daniel. >> >> > > > -- > When an engineer says that something can't be done, it's a code phrase > that means it's not fun to do. >
