Re: [julia-users] Re: Strange performance problem for array scaling

Yichao Yu Mon, 13 Jul 2015 04:30:56 -0700

On Mon, Jul 13, 2015 at 2:20 AM, Jeffrey Sarnoff
<[email protected]> wrote:
> and this: Cleve Moler tries to see it your way
> Moler on floating point denormals
>
>
> On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote:
>>
>> Denormals were made part of the IEEE Floating Point standard after some
>> very careful numerical analysis showed that accomodating them would
>> substantively improve the quality of floating point results and this would
>> lift the quality of all floating point work. Surprising it may be,
>> nonetheless you (and if not you today, you tomorrow and one of your
>> neighbors today) really do care about those unusual, and often rarely
>> observed values.
>>
>> fyi
>> William Kahan on the introduction of denormals to the standard
>> and an early, important paper on this
>> Effects of Underflow on Solving Linear Systems - J.Demmel 1981
>>


Thank you very much for the references.

Yes I definitely believe every part of the IEEE Floating Point
standard has it's reason to be there and I'm more wondering what are
the cases that they are significant. OTOH, I also believe there's
certain kind of computation that do not care about them, which is why
-ffast-math is there.

>From the mathwork blog reference:

> Double precision denormals are so tiny that they are rarely numerically 
> significant, but single precision denormals can be in the range where they 
> affect some otherwise unremarkable computations.

For my computation, I'm currently using double but I've already
checked that switching to single precision still give me enough
precision. Based on this, can I say that I can ignore them if I use
double precision and may need to keep them if I switch to single
precision? Using Float64 takes twice as long as using Float32 while
keeping denormals seems to take 10x time.....



As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However,
I couldn't get it working without #11604[2]. Inline assembly in
llvmcall is working on LLVM 3.6 though[3], in case it's useful for
others.


[1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485
[2] https://github.com/JuliaLang/julia/pull/11604
[3] 
https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3

>>
>> On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote:
>>>
>>> On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu <[email protected]> wrote:
>>> > Further update:
>>> >
>>> > I made a c++ version[1] and see a similar effect (depending on
>>> > optimization levels) so it's not a julia issue (not that I think it
>>> > really was to begin with...).
>>>
>>> After investigating the c++ version more, I find that the difference
>>> between the fast_math and the non-fast_math version is that the
>>> compiler emit a function called `set_fast_math` (see below).
>>>
>>> From what I can tell, the function sets bit 6 and bit 15 on the MXCSR
>>> register (for SSE) and according to this page[1] these are DAZ and FZ
>>> bits (both related to underflow). It also describe denormals as "take
>>> considerably longer to process". Since the operation I have keeps
>>> decreasing the value, I guess it makes sense that there's a value
>>> dependent performance.... (and it kind of make sense that fft also
>>> suffers from these values)
>>>
>>> So now the question is:
>>>
>>> 1. How important are underflow and denormal values? Note that I'm not
>>> catching underflow explicitly anyway and I don't really care about
>>> values that are really small compare to 1.
>>>
>>> 2. Is there a way to set up the SSE registers as done by the c
>>> compilers? @fastmath does not seems to be doing this.
>>>
>>> 00000000000005b0 <set_fast_math>:
>>>  5b0:    0f ae 5c 24 fc           stmxcsr -0x4(%rsp)
>>>  5b5:    81 4c 24 fc 40 80 00     orl    $0x8040,-0x4(%rsp)
>>>  5bc:    00
>>>  5bd:    0f ae 54 24 fc           ldmxcsr -0x4(%rsp)
>>>  5c2:    c3                       retq
>>>  5c3:    66 2e 0f 1f 84 00 00     nopw   %cs:0x0(%rax,%rax,1)
>>>  5ca:    00 00 00
>>>  5cd:    0f 1f 00                 nopl   (%rax)
>>>
>>> [1] http://softpixel.com/~cwright/programming/simd/sse.php
>>>
>>> >
>>> > The slow down is presented in the c++ version for all optimization
>>> > levels except Ofast and ffast-math. The julia version is faster than
>>> > the default performance for both gcc and clang but is slower in the
>>> > fast case for higher optmization levels. For O2 and higher, the c++
>>>
>>> The slowness of the julia version seems to be due to multi dimentional
>>> arrays. Using 1d array yields similar performance with C.
>>>
>>> > version shows a ~100x slow down for the slow case.
>>> >
>>> > @fast_math in julia doesn't seem to have an effect for this although
>>> > it does for clang and gcc...
>>> >
>>> > [1]
>>> > https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp
>>> >
>>> > On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <[email protected]> wrote:
>>> >> Update:
>>> >>
>>> >> I've just got an even simpler version without any complex numbers and
>>> >> only has Float64. The two loops are as small as the following LLVM-IR
>>> >> now and there's only simple arithmetics in the loop body.
>>> >>
>>> >> ```llvm
>>> >> L9.preheader:                                     ; preds = %L12,
>>> >> %L9.preheader.preheader
>>> >>   %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ]
>>> >>   br label %L9
>>> >>
>>> >> L9:                                               ; preds = %L9,
>>> >> %L9.preheader
>>> >>   %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ]
>>> >>   %44 = add i64 %"#s4.0", 1
>>> >>   %45 = add i64 %"#s4.0", -1
>>> >>   %46 = mul i64 %45, %10
>>> >>   %47 = getelementptr double* %7, i64 %46
>>> >>   %48 = load double* %47, align 8
>>> >>   %49 = add i64 %46, 1
>>> >>   %50 = getelementptr double* %7, i64 %49
>>> >>   %51 = load double* %50, align 8
>>> >>   %52 = fmul double %51, %3
>>> >>   %53 = fmul double %38, %48
>>> >>   %54 = fmul double %33, %52
>>> >>   %55 = fadd double %53, %54
>>> >>   store double %55, double* %50, align 8
>>> >>   %56 = fmul double %38, %52
>>> >>   %57 = fmul double %33, %48
>>> >>   %58 = fsub double %56, %57
>>> >>   store double %58, double* %47, align 8
>>> >>   %59 = icmp eq i64 %"#s4.0", %12
>>> >>   br i1 %59, label %L12, label %L9
>>> >>
>>> >> L12:                                              ; preds = %L9
>>> >>   %60 = add i64 %"#s3.0", 1
>>> >>   %61 = icmp eq i64 %"#s3.0", %42
>>> >>   br i1 %61, label %L14.loopexit, label %L9.preheader
>>> >> ```
>>> >>
>>> >> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <[email protected]> wrote:
>>> >>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <[email protected]>
>>> >>> wrote:
>>> >>>> I can't really help you debug the IR code, but I can at least say
>>> >>>> I'm seeing
>>> >>>> a similar thing. It starts to slow down around just after 0.5, and
>>> >>>> doesn't
>>> >>>
>>> >>> Thanks for the confirmation. At least I'm not crazy (or not the only
>>> >>> one to be more precise :P )
>>> >>>
>>> >>>> get back to where it was at 0.5 until 0.87. Can you compare the IR
>>> >>>> code when
>>> >>>> two different values are used, to see what's different? When I tried
>>> >>>> looking
>>> >>>> at the difference between 0.50 and 0.51, the biggest thing that
>>> >>>> popped out
>>> >>>> to me was that the numbers after "!dbg" were different.
>>> >>>
>>> >>> This is exactly the strange part. I don't think either llvm or julia
>>> >>> is doing constant propagation here and different input values should
>>> >>> be using the same function. Evidents are
>>> >>>
>>> >>> 1. Julia only specialize on types not values (for now)
>>> >>> 2. The function is not inlined. Which has to be the case in global
>>> >>> scope and can be double checked by adding `@noinline`. It's also too
>>> >>> big to be inline_worthy
>>> >>> 3. No codegen is happening except the first call. This can be seen
>>> >>> from the output of `@time`. Only the first call has thousands of
>>> >>> allocations. Following call has less than 5 (on 0.4 at least). If
>>> >>> julia can compile a function with less than 5 allocation, I'll be
>>> >>> very
>>> >>> happy.....
>>> >>> 4. My original version also has much more complicated logic to
>>> >>> compute
>>> >>> that scaling factor (ok. it's just a function call, but with
>>> >>> parameters gathered from different arguments) I'd be really surprised
>>> >>> if either llvm or julia can reason anything about it....
>>> >>>
>>> >>> The difference in the debug info should just be an artifact of
>>> >>> emitting it twice.
>>> >>>
>>> >>>>
>>> >>>> Even 0.50001 is a lot slower:
>>> >>>>
>>> >>>> julia> for eΓ in 0.5:0.00001:0.50015
>>> >>>>            println(eΓ)
>>> >>>>            gc()
>>> >>>>            @time ψs = propagate(P, ψ0, ψs, eΓ)
>>> >>>>        end
>>> >>>> 0.5
>>> >>>> elapsed time: 0.065609581 seconds (16 bytes allocated)
>>> >>>> 0.50001
>>> >>>> elapsed time: 0.875806461 seconds (16 bytes allocated)
>>> >>>
>>> >>> This is actually interesting and I can confirm the same here. `0.5`
>>> >>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking
>>> >>> whether the cpu is doing sth special depending on the bit pattern but
>>> >>> I still don't understand why it would be very bad for a certain
>>> >>> range.
>>> >>> (and is not only a function of this either, also affected by all
>>> >>> other
>>> >>> values)
>>> >>>
>>> >>>>
>>> >>>>
>>> >>>> julia> versioninfo()
>>> >>>> Julia Version 0.3.9
>>> >>>> Commit 31efe69 (2015-05-30 11:24 UTC)
>>> >>>> Platform Info:
>>> >>>>   System: Darwin (x86_64-apple-darwin13.4.0)
>>> >>>
>>> >>> Good. so it's not Linux only.
>>> >>>
>>> >>>>   CPU: Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
>>> >>>>   WORD_SIZE: 64
>>> >>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
>>> >>>>   LAPACK: libopenblas
>>> >>>>   LIBM: libopenlibm
>>> >>>>   LLVM: libLLVM-3.3
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:
>>> >>>>>
>>> >>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <[email protected]>
>>> >>>>> wrote:
>>> >>>>> > P.S. Given how strange this problem is for me, I would appreciate
>>> >>>>> > if
>>> >>>>> > anyone can confirm either this is a real issue or I'm somehow
>>> >>>>> > being
>>> >>>>> > crazy or stupid.
>>> >>>>> >
>>> >>>>>
>>> >>>>> One additional strange property of this issue is that I used to
>>> >>>>> have
>>> >>>>> much costy operations in the (outer) loop (the one that iterate
>>> >>>>> over
>>> >>>>> nsteps with i) like Fourier transformations. However, when the
>>> >>>>> scaling
>>> >>>>> factor is taking the bad value, it slows everything down (i.e. the
>>> >>>>> Fourier transformation is also slower by ~10x).
>>> >>>>>
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <[email protected]>
>>> >>>>> > wrote:
>>> >>>>> >> Hi,
>>> >>>>> >>
>>> >>>>> >> I've just seen a very strange (for me) performance difference
>>> >>>>> >> for
>>> >>>>> >> exactly the same code on slightly different input with no
>>> >>>>> >> explicit
>>> >>>>> >> branches.
>>> >>>>> >>
>>> >>>>> >> The code is available here[1]. The most relavant part is the
>>> >>>>> >> following
>>> >>>>> >> function. (All other part of the code are for initialization and
>>> >>>>> >> bench
>>> >>>>> >> mark). This is a simplified version of my similation that
>>> >>>>> >> compute the
>>> >>>>> >> next array column in the array based on the previous one.
>>> >>>>> >>
>>> >>>>> >> The strange part is that the performance of this function can
>>> >>>>> >> differ
>>> >>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only
>>> >>>>> >> use
>>> >>>>> >> of which is marked in the code below) even though I don't see
>>> >>>>> >> any
>>> >>>>> >> branches that depends on that value in the relavant code.
>>> >>>>> >> (unless the
>>> >>>>> >> cpu is 10x less efficient for certain input values)
>>> >>>>> >>
>>> >>>>> >> function propagate(P, ψ0, ψs, eΓ)
>>> >>>>> >>     @inbounds for i in 1:P.nele
>>> >>>>> >>         ψs[1, i, 1] = ψ0[1, i]
>>> >>>>> >>         ψs[2, i, 1] = ψ0[2, i]
>>> >>>>> >>     end
>>> >>>>> >>     T12 = im * sin(P.Ω)
>>> >>>>> >>     T11 = cos(P.Ω)
>>> >>>>> >>     @inbounds for i in 2:(P.nstep + 1)
>>> >>>>> >>         for j in 1:P.nele
>>> >>>>> >>             ψ_e = ψs[1, j, i - 1]
>>> >>>>> >>             ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor
>>> >>>>> >>             ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g
>>> >>>>> >>             ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e
>>> >>>>> >>         end
>>> >>>>> >>     end
>>> >>>>> >>     ψs
>>> >>>>> >> end
>>> >>>>> >>
>>> >>>>> >> The output of the full script is attached and it can be clearly
>>> >>>>> >> seen
>>> >>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times
>>> >>>>> >> slower
>>> >>>>> >> than others.
>>> >>>>> >>
>>> >>>>> >> The assembly[2] and llvm[3] code of this function is also in the
>>> >>>>> >> same
>>> >>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM
>>> >>>>> >> 3.3
>>> >>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a
>>> >>>>> >> linode
>>> >>>>> >> VPS) (the only platform I've tried that doesn't show similar
>>> >>>>> >> behavior
>>> >>>>> >> is running julia 0.4 on qemu-arm....... although the performance
>>> >>>>> >> between different values also differ by ~30% which is bigger
>>> >>>>> >> than
>>> >>>>> >> noise)
>>> >>>>> >>
>>> >>>>> >> This also seems to depend on the initial value.
>>> >>>>> >>
>>> >>>>> >> Has anyone seen similar problems before?
>>> >>>>> >>
>>> >>>>> >> Outputs:
>>> >>>>> >>
>>> >>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB)
>>> >>>>> >> 307.826 milliseconds (4 allocations: 144 bytes)
>>> >>>>> >> 0.0
>>> >>>>> >>  19.227 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.1
>>> >>>>> >>  17.291 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.2
>>> >>>>> >>  17.404 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.3
>>> >>>>> >>  19.231 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.4
>>> >>>>> >>  20.278 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.5
>>> >>>>> >>  23.692 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.6
>>> >>>>> >> 328.107 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.7
>>> >>>>> >> 312.425 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.8
>>> >>>>> >> 201.494 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 0.9
>>> >>>>> >>  16.314 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >> 1.0
>>> >>>>> >>  16.264 milliseconds (2 allocations: 48 bytes)
>>> >>>>> >>
>>> >>>>> >>
>>> >>>>> >> [1]
>>> >>>>> >>
>>> >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
>>> >>>>> >> [2]
>>> >>>>> >>
>>> >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
>>> >>>>> >> [2]
>>> >>>>> >>
>>> >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll

Re: [julia-users] Re: Strange performance problem for array scaling

Reply via email to