Update: I've just got an even simpler version without any complex numbers and only has Float64. The two loops are as small as the following LLVM-IR now and there's only simple arithmetics in the loop body.
```llvm L9.preheader: ; preds = %L12, %L9.preheader.preheader %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] br label %L9 L9: ; preds = %L9, %L9.preheader %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] %44 = add i64 %"#s4.0", 1 %45 = add i64 %"#s4.0", -1 %46 = mul i64 %45, %10 %47 = getelementptr double* %7, i64 %46 %48 = load double* %47, align 8 %49 = add i64 %46, 1 %50 = getelementptr double* %7, i64 %49 %51 = load double* %50, align 8 %52 = fmul double %51, %3 %53 = fmul double %38, %48 %54 = fmul double %33, %52 %55 = fadd double %53, %54 store double %55, double* %50, align 8 %56 = fmul double %38, %52 %57 = fmul double %33, %48 %58 = fsub double %56, %57 store double %58, double* %47, align 8 %59 = icmp eq i64 %"#s4.0", %12 br i1 %59, label %L12, label %L9 L12: ; preds = %L9 %60 = add i64 %"#s3.0", 1 %61 = icmp eq i64 %"#s3.0", %42 br i1 %61, label %L14.loopexit, label %L9.preheader ``` On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <[email protected]> wrote: > On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <[email protected]> wrote: >> I can't really help you debug the IR code, but I can at least say I'm seeing >> a similar thing. It starts to slow down around just after 0.5, and doesn't > > Thanks for the confirmation. At least I'm not crazy (or not the only > one to be more precise :P ) > >> get back to where it was at 0.5 until 0.87. Can you compare the IR code when >> two different values are used, to see what's different? When I tried looking >> at the difference between 0.50 and 0.51, the biggest thing that popped out >> to me was that the numbers after "!dbg" were different. > > This is exactly the strange part. I don't think either llvm or julia > is doing constant propagation here and different input values should > be using the same function. Evidents are > > 1. Julia only specialize on types not values (for now) > 2. The function is not inlined. Which has to be the case in global > scope and can be double checked by adding `@noinline`. It's also too > big to be inline_worthy > 3. No codegen is happening except the first call. This can be seen > from the output of `@time`. Only the first call has thousands of > allocations. Following call has less than 5 (on 0.4 at least). If > julia can compile a function with less than 5 allocation, I'll be very > happy..... > 4. My original version also has much more complicated logic to compute > that scaling factor (ok. it's just a function call, but with > parameters gathered from different arguments) I'd be really surprised > if either llvm or julia can reason anything about it.... > > The difference in the debug info should just be an artifact of > emitting it twice. > >> >> Even 0.50001 is a lot slower: >> >> julia> for eΓ in 0.5:0.00001:0.50015 >> println(eΓ) >> gc() >> @time ψs = propagate(P, ψ0, ψs, eΓ) >> end >> 0.5 >> elapsed time: 0.065609581 seconds (16 bytes allocated) >> 0.50001 >> elapsed time: 0.875806461 seconds (16 bytes allocated) > > This is actually interesting and I can confirm the same here. `0.5` > takes 28ms while 0.5000000000000001 takes 320ms. I was thinking > whether the cpu is doing sth special depending on the bit pattern but > I still don't understand why it would be very bad for a certain range. > (and is not only a function of this either, also affected by all other > values) > >> >> >> julia> versioninfo() >> Julia Version 0.3.9 >> Commit 31efe69 (2015-05-30 11:24 UTC) >> Platform Info: >> System: Darwin (x86_64-apple-darwin13.4.0) > > Good. so it's not Linux only. > >> CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz >> WORD_SIZE: 64 >> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) >> LAPACK: libopenblas >> LIBM: libopenlibm >> LLVM: libLLVM-3.3 >> >> >> >> >> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: >>> >>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <[email protected]> wrote: >>> > P.S. Given how strange this problem is for me, I would appreciate if >>> > anyone can confirm either this is a real issue or I'm somehow being >>> > crazy or stupid. >>> > >>> >>> One additional strange property of this issue is that I used to have >>> much costy operations in the (outer) loop (the one that iterate over >>> nsteps with i) like Fourier transformations. However, when the scaling >>> factor is taking the bad value, it slows everything down (i.e. the >>> Fourier transformation is also slower by ~10x). >>> >>> > >>> > >>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <[email protected]> wrote: >>> >> Hi, >>> >> >>> >> I've just seen a very strange (for me) performance difference for >>> >> exactly the same code on slightly different input with no explicit >>> >> branches. >>> >> >>> >> The code is available here[1]. The most relavant part is the following >>> >> function. (All other part of the code are for initialization and bench >>> >> mark). This is a simplified version of my similation that compute the >>> >> next array column in the array based on the previous one. >>> >> >>> >> The strange part is that the performance of this function can differ >>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only use >>> >> of which is marked in the code below) even though I don't see any >>> >> branches that depends on that value in the relavant code. (unless the >>> >> cpu is 10x less efficient for certain input values) >>> >> >>> >> function propagate(P, ψ0, ψs, eΓ) >>> >> @inbounds for i in 1:P.nele >>> >> ψs[1, i, 1] = ψ0[1, i] >>> >> ψs[2, i, 1] = ψ0[2, i] >>> >> end >>> >> T12 = im * sin(P.Ω) >>> >> T11 = cos(P.Ω) >>> >> @inbounds for i in 2:(P.nstep + 1) >>> >> for j in 1:P.nele >>> >> ψ_e = ψs[1, j, i - 1] >>> >> ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor >>> >> ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g >>> >> ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e >>> >> end >>> >> end >>> >> ψs >>> >> end >>> >> >>> >> The output of the full script is attached and it can be clearly seen >>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times slower >>> >> than others. >>> >> >>> >> The assembly[2] and llvm[3] code of this function is also in the same >>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 >>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a linode >>> >> VPS) (the only platform I've tried that doesn't show similar behavior >>> >> is running julia 0.4 on qemu-arm....... although the performance >>> >> between different values also differ by ~30% which is bigger than >>> >> noise) >>> >> >>> >> This also seems to depend on the initial value. >>> >> >>> >> Has anyone seen similar problems before? >>> >> >>> >> Outputs: >>> >> >>> >> 325.821 milliseconds (25383 allocations: 1159 KB) >>> >> 307.826 milliseconds (4 allocations: 144 bytes) >>> >> 0.0 >>> >> 19.227 milliseconds (2 allocations: 48 bytes) >>> >> 0.1 >>> >> 17.291 milliseconds (2 allocations: 48 bytes) >>> >> 0.2 >>> >> 17.404 milliseconds (2 allocations: 48 bytes) >>> >> 0.3 >>> >> 19.231 milliseconds (2 allocations: 48 bytes) >>> >> 0.4 >>> >> 20.278 milliseconds (2 allocations: 48 bytes) >>> >> 0.5 >>> >> 23.692 milliseconds (2 allocations: 48 bytes) >>> >> 0.6 >>> >> 328.107 milliseconds (2 allocations: 48 bytes) >>> >> 0.7 >>> >> 312.425 milliseconds (2 allocations: 48 bytes) >>> >> 0.8 >>> >> 201.494 milliseconds (2 allocations: 48 bytes) >>> >> 0.9 >>> >> 16.314 milliseconds (2 allocations: 48 bytes) >>> >> 1.0 >>> >> 16.264 milliseconds (2 allocations: 48 bytes) >>> >> >>> >> >>> >> [1] >>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl >>> >> [2] >>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S >>> >> [2] >>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll
