Hi,

I've just seen a very strange (for me) performance difference for
exactly the same code on slightly different input with no explicit
branches.

The code is available here[1]. The most relavant part is the following
function. (All other part of the code are for initialization and bench
mark). This is a simplified version of my similation that compute the
next array column in the array based on the previous one.

The strange part is that the performance of this function can differ
by 10x depend on the value of the scaling factor (`eΓ`, the only use
of which is marked in the code below) even though I don't see any
branches that depends on that value in the relavant code. (unless the
cpu is 10x less efficient for certain input values)

function propagate(P, ψ0, ψs, eΓ)
    @inbounds for i in 1:P.nele
        ψs[1, i, 1] = ψ0[1, i]
        ψs[2, i, 1] = ψ0[2, i]
    end
    T12 = im * sin(P.Ω)
    T11 = cos(P.Ω)
    @inbounds for i in 2:(P.nstep + 1)
        for j in 1:P.nele
            ψ_e = ψs[1, j, i - 1]
            ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor
            ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g
            ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e
        end
    end
    ψs
end

The output of the full script is attached and it can be clearly seen
that for scaling factor 0.6-0.8, the performance is 5-10 times slower
than others.

The assembly[2] and llvm[3] code of this function is also in the same
repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3
and LLVM 3.6 on two different x86_64 machine (my laptop and a linode
VPS) (the only platform I've tried that doesn't show similar behavior
is running julia 0.4 on qemu-arm....... although the performance
between different values also differ by ~30% which is bigger than
noise)

This also seems to depend on the initial value.

Has anyone seen similar problems before?

Outputs:

325.821 milliseconds (25383 allocations: 1159 KB)
307.826 milliseconds (4 allocations: 144 bytes)
0.0
 19.227 milliseconds (2 allocations: 48 bytes)
0.1
 17.291 milliseconds (2 allocations: 48 bytes)
0.2
 17.404 milliseconds (2 allocations: 48 bytes)
0.3
 19.231 milliseconds (2 allocations: 48 bytes)
0.4
 20.278 milliseconds (2 allocations: 48 bytes)
0.5
 23.692 milliseconds (2 allocations: 48 bytes)
0.6
328.107 milliseconds (2 allocations: 48 bytes)
0.7
312.425 milliseconds (2 allocations: 48 bytes)
0.8
201.494 milliseconds (2 allocations: 48 bytes)
0.9
 16.314 milliseconds (2 allocations: 48 bytes)
1.0
 16.264 milliseconds (2 allocations: 48 bytes)


[1] 
https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
[2] 
https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
[2] 
https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll

Reply via email to