Re: [julia-users] Help: My "parallel" code 8300x slower than the serial.

Miguel Bazdresch Wed, 17 Jun 2015 14:20:08 -0700

Can you arrange the problem so that you send each CPU a few seconds of
work? The overhead would become negligible.


-- mb

On Wed, Jun 17, 2015 at 4:38 PM, Daniel Carrera <[email protected]> wrote:

> Sadly, this function is pretty close to the real workload. I do n-body
> simulations of planetary systems. In this problem, the value of "n" is
> small, but the simulation runs for a very long time. So this nested for
> loop will be called maybe 10 billion times in the course of one simulation,
> and that's where all the CPU time goes.
>
> I already have simulation software that works well enough for this. I just
> wanted to experiment with Julia to see if this could be made parallel. An
> irritating problem with all the codes that solve planetary systems is that
> they are all serial -- this problem is apparently hard to parallelize.
>
>
> Cheers,
> Daniel.
>
>
>
> On 17 June 2015 at 17:02, Tim Holy <[email protected]> wrote:
>
>> You're copying a lot of data between processes. Check out SharedArrays.
>> But I
>> still fear that if each "job" is tiny, you may not get as much benefit
>> without
>> further restructuring.
>>
>> I trust that your "real" workload will take more than 1ms. Otherwise, it's
>> very unlikely that your experiments in parallel programming will end up
>> saving
>> you time :-).
>>
>> --Tim
>>
>> On Wednesday, June 17, 2015 06:37:28 AM Daniel Carrera wrote:
>> > Hi everyone,
>> >
>> > My adventures with parallel programming with Julia continue. Here is a
>> > different issue from other threads: My parallel function is 8300x slower
>> > than my serial function even though I am running on 4 processes on a
>> > multi-core machine.
>> >
>> > julia> nprocs()
>> > 4
>> >
>> > I have Julia 0.3.8. Here is my program in its entirety (not very long).
>> >
>> > function main()
>> >
>> > nbig::Int16 = 7
>> > nbod::Int16 = nbig
>> > bod  = Float64[
>> >         0       1      2      3      4      5      6  # x position
>> >         0       0      0      0      0      0      0  # y position
>> >         0       0      0      0      0      0      0  # z position
>> >         0       0      0      0      0      0      0  # x velocity
>> >         0       0      0      0      0      0      0  # y velocity
>> >         0       0      0      0      0      0      0  # z velocity
>> >         1       1      1      1      1      1      1  # Mass
>> >     ]
>> >
>> >     a = zeros(3,nbod)
>> >
>> >     @time for k = 1:1000
>> >         gravity_1!(bod, nbig, nbod, a)
>> >     end
>> >     println(a[1,:])
>> >
>> >     @time for k = 1:1000
>> >         gravity_2!(bod, nbig, nbod, a)
>> > end
>> >     println(a[1,:])
>> > end
>> >
>> > function gravity_1!(bod, nbig, nbod, a)
>> >
>> >     for i = 1:nbod
>> >         a[1,i] = 0.0
>> >         a[2,i] = 0.0
>> >         a[3,i] = 0.0
>> >     end
>> >
>> >     @inbounds for i = 1:nbig
>> >         for j = (i + 1):nbod
>> >
>> >             dx = bod[1,j] - bod[1,i]
>> >             dy = bod[2,j] - bod[2,i]
>> >             dz = bod[3,j] - bod[3,i]
>> >
>> >             s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz)
>> >             s_3 = s_1 * s_1 * s_1
>> >
>> >             tmp1 = s_3 * bod[7,i]
>> >             tmp2 = s_3 * bod[7,j]
>> >
>> >             a[1,j] = a[1,j] - tmp1*dx
>> >             a[2,j] = a[2,j] - tmp1*dy
>> >             a[3,j] = a[3,j] - tmp1*dz
>> >
>> >             a[1,i] = a[1,i] + tmp2*dx
>> >             a[2,i] = a[2,i] + tmp2*dy
>> >             a[3,i] = a[3,i] + tmp2*dz
>> >         end
>> >     end
>> >     return a
>> > end
>> >
>> > function gravity_2!(bod, nbig, nbod, a)
>> >
>> >     for i = 1:nbod
>> >         a[1,i] = 0.0
>> >         a[2,i] = 0.0
>> >         a[3,i] = 0.0
>> >     end
>> >
>> >     @inbounds @sync @parallel for i = 1:nbig
>> >         for j = (i + 1):nbod
>> >
>> >             dx = bod[1,j] - bod[1,i]
>> >             dy = bod[2,j] - bod[2,i]
>> >             dz = bod[3,j] - bod[3,i]
>> >
>> >             s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz)
>> >             s_3 = s_1 * s_1 * s_1
>> >
>> >             tmp1 = s_3 * bod[7,i]
>> >             tmp2 = s_3 * bod[7,j]
>> >
>> >             a[1,j] = a[1,j] - tmp1*dx
>> >             a[2,j] = a[2,j] - tmp1*dy
>> >             a[3,j] = a[3,j] - tmp1*dz
>> >
>> >             a[1,i] = a[1,i] + tmp2*dx
>> >             a[2,i] = a[2,i] + tmp2*dy
>> >             a[3,i] = a[3,i] + tmp2*dz
>> >         end
>> >     end
>> >     return a
>> > end
>> >
>> >
>> >
>> > So this is a straight forward N-body gravity calculation. Yes, I realize
>> > that gravity_2!() is wrong, but that's fine. Right now I'm just talking
>> > about the CPU time. When I run this on my computer I get:
>> >
>> > julia> main()
>> > elapsed time: 0.000475294 seconds (0 bytes allocated)
>> > [1.4913888888888889 0.4636111111111111 0.1736111111111111
>> > -5.551115123125783e-17 -0.17361111111111116 -0.4636111111111112
>> > -1.4913888888888889]
>> > elapsed time: 3.953546654 seconds (126156320 bytes allocated, 13.49% gc
>> > time)
>> > [0.0 0.0 0.0 0.0 0.0 0.0 0.0]
>> >
>> >
>> > So, the serial version takes 0.000475 seconds and the parallel takes
>> 3.95
>> > seconds. Furthermore, the parallel version is calling the garbage
>> > collector. I suspect that the problem has something to do with the
>> memory
>> > access. Maybe the parallel code is wasting a lot of time copying
>> variables
>> > in memory. But whatever the reason, this is bad. The documentation says
>> > that @parallel is supposed to be fast, even for very small loops, but
>> > that's not what I'm seeing. A non-buggy implementation will be even
>> slower.
>> >
>> > Have I missed something? Is there an obvious error in how I'm using the
>> > parallel constructs?
>> >
>> > I would appreciate any guidance you may offer.
>> >
>> > Cheers,
>> > Daniel.
>>
>>
>
>
> --
> When an engineer says that something can't be done, it's a code phrase
> that means it's not fun to do.
>

Re: [julia-users] Help: My "parallel" code 8300x slower than the serial.

Reply via email to