Re: [julia-users] Re: Strange performance problem for array scaling

Jeffrey Sarnoff Sun, 12 Jul 2015 23:11:41 -0700

Denormals were made part of the IEEE Floating Point standard after some 
very careful numerical analysis showed that accomodating them would 
substantively improve the quality of floating point results and this would 
lift the quality of all floating point work. Surprising it may be, 
nonetheless you (and if not you today, you tomorrow and one of your 
neighbors today) really do care about those unusual, and often rarely 
observed values.


fyi
William Kahan on the introduction of denormals to the standard 
<https://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html>
and an early, important paper on this
Effects of Underflow on Solving Linear Systems - J.Demmel 1981 
<http://www.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-128.pdf>
  

On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote:
>
> On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu <yyc...@gmail.com 
> <javascript:>> wrote: 
> > Further update: 
> > 
> > I made a c++ version[1] and see a similar effect (depending on 
> > optimization levels) so it's not a julia issue (not that I think it 
> > really was to begin with...). 
>
> After investigating the c++ version more, I find that the difference 
> between the fast_math and the non-fast_math version is that the 
> compiler emit a function called `set_fast_math` (see below). 
>
> From what I can tell, the function sets bit 6 and bit 15 on the MXCSR 
> register (for SSE) and according to this page[1] these are DAZ and FZ 
> bits (both related to underflow). It also describe denormals as "take 
> considerably longer to process". Since the operation I have keeps 
> decreasing the value, I guess it makes sense that there's a value 
> dependent performance.... (and it kind of make sense that fft also 
> suffers from these values) 
>
> So now the question is: 
>
> 1. How important are underflow and denormal values? Note that I'm not 
> catching underflow explicitly anyway and I don't really care about 
> values that are really small compare to 1. 
>
> 2. Is there a way to set up the SSE registers as done by the c 
> compilers? @fastmath does not seems to be doing this. 
>
> 00000000000005b0 <set_fast_math>: 
>  5b0:    0f ae 5c 24 fc           stmxcsr -0x4(%rsp) 
>  5b5:    81 4c 24 fc 40 80 00     orl    $0x8040,-0x4(%rsp) 
>  5bc:    00 
>  5bd:    0f ae 54 24 fc           ldmxcsr -0x4(%rsp) 
>  5c2:    c3                       retq 
>  5c3:    66 2e 0f 1f 84 00 00     nopw   %cs:0x0(%rax,%rax,1) 
>  5ca:    00 00 00 
>  5cd:    0f 1f 00                 nopl   (%rax) 
>
> [1] http://softpixel.com/~cwright/programming/simd/sse.php 
>
> > 
> > The slow down is presented in the c++ version for all optimization 
> > levels except Ofast and ffast-math. The julia version is faster than 
> > the default performance for both gcc and clang but is slower in the 
> > fast case for higher optmization levels. For O2 and higher, the c++ 
>
> The slowness of the julia version seems to be due to multi dimentional 
> arrays. Using 1d array yields similar performance with C. 
>
> > version shows a ~100x slow down for the slow case. 
> > 
> > @fast_math in julia doesn't seem to have an effect for this although 
> > it does for clang and gcc... 
> > 
> > [1] 
> https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp
>  
> > 
> > On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <yyc...@gmail.com 
> <javascript:>> wrote: 
> >> Update: 
> >> 
> >> I've just got an even simpler version without any complex numbers and 
> >> only has Float64. The two loops are as small as the following LLVM-IR 
> >> now and there's only simple arithmetics in the loop body. 
> >> 
> >> ```llvm 
> >> L9.preheader:                                     ; preds = %L12, 
> >> %L9.preheader.preheader 
> >>   %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] 
> >>   br label %L9 
> >> 
> >> L9:                                               ; preds = %L9, 
> %L9.preheader 
> >>   %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] 
> >>   %44 = add i64 %"#s4.0", 1 
> >>   %45 = add i64 %"#s4.0", -1 
> >>   %46 = mul i64 %45, %10 
> >>   %47 = getelementptr double* %7, i64 %46 
> >>   %48 = load double* %47, align 8 
> >>   %49 = add i64 %46, 1 
> >>   %50 = getelementptr double* %7, i64 %49 
> >>   %51 = load double* %50, align 8 
> >>   %52 = fmul double %51, %3 
> >>   %53 = fmul double %38, %48 
> >>   %54 = fmul double %33, %52 
> >>   %55 = fadd double %53, %54 
> >>   store double %55, double* %50, align 8 
> >>   %56 = fmul double %38, %52 
> >>   %57 = fmul double %33, %48 
> >>   %58 = fsub double %56, %57 
> >>   store double %58, double* %47, align 8 
> >>   %59 = icmp eq i64 %"#s4.0", %12 
> >>   br i1 %59, label %L12, label %L9 
> >> 
> >> L12:                                              ; preds = %L9 
> >>   %60 = add i64 %"#s3.0", 1 
> >>   %61 = icmp eq i64 %"#s3.0", %42 
> >>   br i1 %61, label %L14.loopexit, label %L9.preheader 
> >> ``` 
> >> 
> >> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <yyc...@gmail.com 
> <javascript:>> wrote: 
> >>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin....@gmail.com 
> <javascript:>> wrote: 
> >>>> I can't really help you debug the IR code, but I can at least say I'm 
> seeing 
> >>>> a similar thing. It starts to slow down around just after 0.5, and 
> doesn't 
> >>> 
> >>> Thanks for the confirmation. At least I'm not crazy (or not the only 
> >>> one to be more precise :P ) 
> >>> 
> >>>> get back to where it was at 0.5 until 0.87. Can you compare the IR 
> code when 
> >>>> two different values are used, to see what's different? When I tried 
> looking 
> >>>> at the difference between 0.50 and 0.51, the biggest thing that 
> popped out 
> >>>> to me was that the numbers after "!dbg" were different. 
> >>> 
> >>> This is exactly the strange part. I don't think either llvm or julia 
> >>> is doing constant propagation here and different input values should 
> >>> be using the same function. Evidents are 
> >>> 
> >>> 1. Julia only specialize on types not values (for now) 
> >>> 2. The function is not inlined. Which has to be the case in global 
> >>> scope and can be double checked by adding `@noinline`. It's also too 
> >>> big to be inline_worthy 
> >>> 3. No codegen is happening except the first call. This can be seen 
> >>> from the output of `@time`. Only the first call has thousands of 
> >>> allocations. Following call has less than 5 (on 0.4 at least). If 
> >>> julia can compile a function with less than 5 allocation, I'll be very 
> >>> happy..... 
> >>> 4. My original version also has much more complicated logic to compute 
> >>> that scaling factor (ok. it's just a function call, but with 
> >>> parameters gathered from different arguments) I'd be really surprised 
> >>> if either llvm or julia can reason anything about it.... 
> >>> 
> >>> The difference in the debug info should just be an artifact of 
> >>> emitting it twice. 
> >>> 
> >>>> 
> >>>> Even 0.50001 is a lot slower: 
> >>>> 
> >>>> julia> for eΓ in 0.5:0.00001:0.50015 
> >>>>            println(eΓ) 
> >>>>            gc() 
> >>>>            @time ψs = propagate(P, ψ0, ψs, eΓ) 
> >>>>        end 
> >>>> 0.5 
> >>>> elapsed time: 0.065609581 seconds (16 bytes allocated) 
> >>>> 0.50001 
> >>>> elapsed time: 0.875806461 seconds (16 bytes allocated) 
> >>> 
> >>> This is actually interesting and I can confirm the same here. `0.5` 
> >>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking 
> >>> whether the cpu is doing sth special depending on the bit pattern but 
> >>> I still don't understand why it would be very bad for a certain range. 
> >>> (and is not only a function of this either, also affected by all other 
> >>> values) 
> >>> 
> >>>> 
> >>>> 
> >>>> julia> versioninfo() 
> >>>> Julia Version 0.3.9 
> >>>> Commit 31efe69 (2015-05-30 11:24 UTC) 
> >>>> Platform Info: 
> >>>>   System: Darwin (x86_64-apple-darwin13.4.0) 
> >>> 
> >>> Good. so it's not Linux only. 
> >>> 
> >>>>   CPU: Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz 
> >>>>   WORD_SIZE: 64 
> >>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) 
> >>>>   LAPACK: libopenblas 
> >>>>   LIBM: libopenlibm 
> >>>>   LLVM: libLLVM-3.3 
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: 
> >>>>> 
> >>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> 
> wrote: 
> >>>>> > P.S. Given how strange this problem is for me, I would appreciate 
> if 
> >>>>> > anyone can confirm either this is a real issue or I'm somehow 
> being 
> >>>>> > crazy or stupid. 
> >>>>> > 
> >>>>> 
> >>>>> One additional strange property of this issue is that I used to have 
> >>>>> much costy operations in the (outer) loop (the one that iterate over 
> >>>>> nsteps with i) like Fourier transformations. However, when the 
> scaling 
> >>>>> factor is taking the bad value, it slows everything down (i.e. the 
> >>>>> Fourier transformation is also slower by ~10x). 
> >>>>> 
> >>>>> > 
> >>>>> > 
> >>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> 
> wrote: 
> >>>>> >> Hi, 
> >>>>> >> 
> >>>>> >> I've just seen a very strange (for me) performance difference for 
> >>>>> >> exactly the same code on slightly different input with no 
> explicit 
> >>>>> >> branches. 
> >>>>> >> 
> >>>>> >> The code is available here[1]. The most relavant part is the 
> following 
> >>>>> >> function. (All other part of the code are for initialization and 
> bench 
> >>>>> >> mark). This is a simplified version of my similation that compute 
> the 
> >>>>> >> next array column in the array based on the previous one. 
> >>>>> >> 
> >>>>> >> The strange part is that the performance of this function can 
> differ 
> >>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only 
> use 
> >>>>> >> of which is marked in the code below) even though I don't see any 
> >>>>> >> branches that depends on that value in the relavant code. (unless 
> the 
> >>>>> >> cpu is 10x less efficient for certain input values) 
> >>>>> >> 
> >>>>> >> function propagate(P, ψ0, ψs, eΓ) 
> >>>>> >>     @inbounds for i in 1:P.nele 
> >>>>> >>         ψs[1, i, 1] = ψ0[1, i] 
> >>>>> >>         ψs[2, i, 1] = ψ0[2, i] 
> >>>>> >>     end 
> >>>>> >>     T12 = im * sin(P.Ω) 
> >>>>> >>     T11 = cos(P.Ω) 
> >>>>> >>     @inbounds for i in 2:(P.nstep + 1) 
> >>>>> >>         for j in 1:P.nele 
> >>>>> >>             ψ_e = ψs[1, j, i - 1] 
> >>>>> >>             ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor 
> >>>>> >>             ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g 
> >>>>> >>             ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e 
> >>>>> >>         end 
> >>>>> >>     end 
> >>>>> >>     ψs 
> >>>>> >> end 
> >>>>> >> 
> >>>>> >> The output of the full script is attached and it can be clearly 
> seen 
> >>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times 
> slower 
> >>>>> >> than others. 
> >>>>> >> 
> >>>>> >> The assembly[2] and llvm[3] code of this function is also in the 
> same 
> >>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 
> 3.3 
> >>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a 
> linode 
> >>>>> >> VPS) (the only platform I've tried that doesn't show similar 
> behavior 
> >>>>> >> is running julia 0.4 on qemu-arm....... although the performance 
> >>>>> >> between different values also differ by ~30% which is bigger than 
> >>>>> >> noise) 
> >>>>> >> 
> >>>>> >> This also seems to depend on the initial value. 
> >>>>> >> 
> >>>>> >> Has anyone seen similar problems before? 
> >>>>> >> 
> >>>>> >> Outputs: 
> >>>>> >> 
> >>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB) 
> >>>>> >> 307.826 milliseconds (4 allocations: 144 bytes) 
> >>>>> >> 0.0 
> >>>>> >>  19.227 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.1 
> >>>>> >>  17.291 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.2 
> >>>>> >>  17.404 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.3 
> >>>>> >>  19.231 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.4 
> >>>>> >>  20.278 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.5 
> >>>>> >>  23.692 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.6 
> >>>>> >> 328.107 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.7 
> >>>>> >> 312.425 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.8 
> >>>>> >> 201.494 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 0.9 
> >>>>> >>  16.314 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 1.0 
> >>>>> >>  16.264 milliseconds (2 allocations: 48 bytes) 
> >>>>> >> 
> >>>>> >> 
> >>>>> >> [1] 
> >>>>> >> 
> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
>  
> >>>>> >> [2] 
> >>>>> >> 
> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
>  
> >>>>> >> [2] 
> >>>>> >> 
> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll
>  
>

Re: [julia-users] Re: Strange performance problem for array scaling

Reply via email to