Update:

I've just got an even simpler version without any complex numbers and
only has Float64. The two loops are as small as the following LLVM-IR
now and there's only simple arithmetics in the loop body.

```llvm
L9.preheader:                                     ; preds = %L12,
%L9.preheader.preheader
  %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ]
  br label %L9

L9:                                               ; preds = %L9, %L9.preheader
  %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ]
  %44 = add i64 %"#s4.0", 1
  %45 = add i64 %"#s4.0", -1
  %46 = mul i64 %45, %10
  %47 = getelementptr double* %7, i64 %46
  %48 = load double* %47, align 8
  %49 = add i64 %46, 1
  %50 = getelementptr double* %7, i64 %49
  %51 = load double* %50, align 8
  %52 = fmul double %51, %3
  %53 = fmul double %38, %48
  %54 = fmul double %33, %52
  %55 = fadd double %53, %54
  store double %55, double* %50, align 8
  %56 = fmul double %38, %52
  %57 = fmul double %33, %48
  %58 = fsub double %56, %57
  store double %58, double* %47, align 8
  %59 = icmp eq i64 %"#s4.0", %12
  br i1 %59, label %L12, label %L9

L12:                                              ; preds = %L9
  %60 = add i64 %"#s3.0", 1
  %61 = icmp eq i64 %"#s3.0", %42
  br i1 %61, label %L14.loopexit, label %L9.preheader
```

On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <[email protected]> wrote:
> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <[email protected]> wrote:
>> I can't really help you debug the IR code, but I can at least say I'm seeing
>> a similar thing. It starts to slow down around just after 0.5, and doesn't
>
> Thanks for the confirmation. At least I'm not crazy (or not the only
> one to be more precise :P )
>
>> get back to where it was at 0.5 until 0.87. Can you compare the IR code when
>> two different values are used, to see what's different? When I tried looking
>> at the difference between 0.50 and 0.51, the biggest thing that popped out
>> to me was that the numbers after "!dbg" were different.
>
> This is exactly the strange part. I don't think either llvm or julia
> is doing constant propagation here and different input values should
> be using the same function. Evidents are
>
> 1. Julia only specialize on types not values (for now)
> 2. The function is not inlined. Which has to be the case in global
> scope and can be double checked by adding `@noinline`. It's also too
> big to be inline_worthy
> 3. No codegen is happening except the first call. This can be seen
> from the output of `@time`. Only the first call has thousands of
> allocations. Following call has less than 5 (on 0.4 at least). If
> julia can compile a function with less than 5 allocation, I'll be very
> happy.....
> 4. My original version also has much more complicated logic to compute
> that scaling factor (ok. it's just a function call, but with
> parameters gathered from different arguments) I'd be really surprised
> if either llvm or julia can reason anything about it....
>
> The difference in the debug info should just be an artifact of
> emitting it twice.
>
>>
>> Even 0.50001 is a lot slower:
>>
>> julia> for eΓ in 0.5:0.00001:0.50015
>>            println(eΓ)
>>            gc()
>>            @time ψs = propagate(P, ψ0, ψs, eΓ)
>>        end
>> 0.5
>> elapsed time: 0.065609581 seconds (16 bytes allocated)
>> 0.50001
>> elapsed time: 0.875806461 seconds (16 bytes allocated)
>
> This is actually interesting and I can confirm the same here. `0.5`
> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking
> whether the cpu is doing sth special depending on the bit pattern but
> I still don't understand why it would be very bad for a certain range.
> (and is not only a function of this either, also affected by all other
> values)
>
>>
>>
>> julia> versioninfo()
>> Julia Version 0.3.9
>> Commit 31efe69 (2015-05-30 11:24 UTC)
>> Platform Info:
>>   System: Darwin (x86_64-apple-darwin13.4.0)
>
> Good. so it's not Linux only.
>
>>   CPU: Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
>>   WORD_SIZE: 64
>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
>>   LAPACK: libopenblas
>>   LIBM: libopenlibm
>>   LLVM: libLLVM-3.3
>>
>>
>>
>>
>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:
>>>
>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <[email protected]> wrote:
>>> > P.S. Given how strange this problem is for me, I would appreciate if
>>> > anyone can confirm either this is a real issue or I'm somehow being
>>> > crazy or stupid.
>>> >
>>>
>>> One additional strange property of this issue is that I used to have
>>> much costy operations in the (outer) loop (the one that iterate over
>>> nsteps with i) like Fourier transformations. However, when the scaling
>>> factor is taking the bad value, it slows everything down (i.e. the
>>> Fourier transformation is also slower by ~10x).
>>>
>>> >
>>> >
>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <[email protected]> wrote:
>>> >> Hi,
>>> >>
>>> >> I've just seen a very strange (for me) performance difference for
>>> >> exactly the same code on slightly different input with no explicit
>>> >> branches.
>>> >>
>>> >> The code is available here[1]. The most relavant part is the following
>>> >> function. (All other part of the code are for initialization and bench
>>> >> mark). This is a simplified version of my similation that compute the
>>> >> next array column in the array based on the previous one.
>>> >>
>>> >> The strange part is that the performance of this function can differ
>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only use
>>> >> of which is marked in the code below) even though I don't see any
>>> >> branches that depends on that value in the relavant code. (unless the
>>> >> cpu is 10x less efficient for certain input values)
>>> >>
>>> >> function propagate(P, ψ0, ψs, eΓ)
>>> >>     @inbounds for i in 1:P.nele
>>> >>         ψs[1, i, 1] = ψ0[1, i]
>>> >>         ψs[2, i, 1] = ψ0[2, i]
>>> >>     end
>>> >>     T12 = im * sin(P.Ω)
>>> >>     T11 = cos(P.Ω)
>>> >>     @inbounds for i in 2:(P.nstep + 1)
>>> >>         for j in 1:P.nele
>>> >>             ψ_e = ψs[1, j, i - 1]
>>> >>             ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor
>>> >>             ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g
>>> >>             ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e
>>> >>         end
>>> >>     end
>>> >>     ψs
>>> >> end
>>> >>
>>> >> The output of the full script is attached and it can be clearly seen
>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times slower
>>> >> than others.
>>> >>
>>> >> The assembly[2] and llvm[3] code of this function is also in the same
>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3
>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a linode
>>> >> VPS) (the only platform I've tried that doesn't show similar behavior
>>> >> is running julia 0.4 on qemu-arm....... although the performance
>>> >> between different values also differ by ~30% which is bigger than
>>> >> noise)
>>> >>
>>> >> This also seems to depend on the initial value.
>>> >>
>>> >> Has anyone seen similar problems before?
>>> >>
>>> >> Outputs:
>>> >>
>>> >> 325.821 milliseconds (25383 allocations: 1159 KB)
>>> >> 307.826 milliseconds (4 allocations: 144 bytes)
>>> >> 0.0
>>> >>  19.227 milliseconds (2 allocations: 48 bytes)
>>> >> 0.1
>>> >>  17.291 milliseconds (2 allocations: 48 bytes)
>>> >> 0.2
>>> >>  17.404 milliseconds (2 allocations: 48 bytes)
>>> >> 0.3
>>> >>  19.231 milliseconds (2 allocations: 48 bytes)
>>> >> 0.4
>>> >>  20.278 milliseconds (2 allocations: 48 bytes)
>>> >> 0.5
>>> >>  23.692 milliseconds (2 allocations: 48 bytes)
>>> >> 0.6
>>> >> 328.107 milliseconds (2 allocations: 48 bytes)
>>> >> 0.7
>>> >> 312.425 milliseconds (2 allocations: 48 bytes)
>>> >> 0.8
>>> >> 201.494 milliseconds (2 allocations: 48 bytes)
>>> >> 0.9
>>> >>  16.314 milliseconds (2 allocations: 48 bytes)
>>> >> 1.0
>>> >>  16.264 milliseconds (2 allocations: 48 bytes)
>>> >>
>>> >>
>>> >> [1]
>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
>>> >> [2]
>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
>>> >> [2]
>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll

Reply via email to