Re: [julia-users] Re: Strange performance problem for array scaling

Jeffrey Sarnoff Sun, 12 Jul 2015 23:20:48 -0700

and this: Cleve Moler tries to see it your way 
Moler on floating point denormals 
<http://blogs.mathworks.com/cleve/2014/07/21/floating-point-denormals-insignificant-but-controversial-2/>


On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote:
>
> Denormals were made part of the IEEE Floating Point standard after some 
> very careful numerical analysis showed that accomodating them would 
> substantively improve the quality of floating point results and this would 
> lift the quality of all floating point work. Surprising it may be, 
> nonetheless you (and if not you today, you tomorrow and one of your 
> neighbors today) really do care about those unusual, and often rarely 
> observed values.
>
> fyi
> William Kahan on the introduction of denormals to the standard 
> <https://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html>
> and an early, important paper on this
> Effects of Underflow on Solving Linear Systems - J.Demmel 1981 
> <http://www.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-128.pdf>
>   
>
> On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote:
>>
>> On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu <yyc...@gmail.com> wrote: 
>> > Further update: 
>> > 
>> > I made a c++ version[1] and see a similar effect (depending on 
>> > optimization levels) so it's not a julia issue (not that I think it 
>> > really was to begin with...). 
>>
>> After investigating the c++ version more, I find that the difference 
>> between the fast_math and the non-fast_math version is that the 
>> compiler emit a function called `set_fast_math` (see below). 
>>
>> From what I can tell, the function sets bit 6 and bit 15 on the MXCSR 
>> register (for SSE) and according to this page[1] these are DAZ and FZ 
>> bits (both related to underflow). It also describe denormals as "take 
>> considerably longer to process". Since the operation I have keeps 
>> decreasing the value, I guess it makes sense that there's a value 
>> dependent performance.... (and it kind of make sense that fft also 
>> suffers from these values) 
>>
>> So now the question is: 
>>
>> 1. How important are underflow and denormal values? Note that I'm not 
>> catching underflow explicitly anyway and I don't really care about 
>> values that are really small compare to 1. 
>>
>> 2. Is there a way to set up the SSE registers as done by the c 
>> compilers? @fastmath does not seems to be doing this. 
>>
>> 00000000000005b0 <set_fast_math>: 
>>  5b0:    0f ae 5c 24 fc           stmxcsr -0x4(%rsp) 
>>  5b5:    81 4c 24 fc 40 80 00     orl    $0x8040,-0x4(%rsp) 
>>  5bc:    00 
>>  5bd:    0f ae 54 24 fc           ldmxcsr -0x4(%rsp) 
>>  5c2:    c3                       retq 
>>  5c3:    66 2e 0f 1f 84 00 00     nopw   %cs:0x0(%rax,%rax,1) 
>>  5ca:    00 00 00 
>>  5cd:    0f 1f 00                 nopl   (%rax) 
>>
>> [1] http://softpixel.com/~cwright/programming/simd/sse.php 
>>
>> > 
>> > The slow down is presented in the c++ version for all optimization 
>> > levels except Ofast and ffast-math. The julia version is faster than 
>> > the default performance for both gcc and clang but is slower in the 
>> > fast case for higher optmization levels. For O2 and higher, the c++ 
>>
>> The slowness of the julia version seems to be due to multi dimentional 
>> arrays. Using 1d array yields similar performance with C. 
>>
>> > version shows a ~100x slow down for the slow case. 
>> > 
>> > @fast_math in julia doesn't seem to have an effect for this although 
>> > it does for clang and gcc... 
>> > 
>> > [1] 
>> https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp
>>  
>> > 
>> > On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <yyc...@gmail.com> wrote: 
>> >> Update: 
>> >> 
>> >> I've just got an even simpler version without any complex numbers and 
>> >> only has Float64. The two loops are as small as the following LLVM-IR 
>> >> now and there's only simple arithmetics in the loop body. 
>> >> 
>> >> ```llvm 
>> >> L9.preheader:                                     ; preds = %L12, 
>> >> %L9.preheader.preheader 
>> >>   %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] 
>> >>   br label %L9 
>> >> 
>> >> L9:                                               ; preds = %L9, 
>> %L9.preheader 
>> >>   %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] 
>> >>   %44 = add i64 %"#s4.0", 1 
>> >>   %45 = add i64 %"#s4.0", -1 
>> >>   %46 = mul i64 %45, %10 
>> >>   %47 = getelementptr double* %7, i64 %46 
>> >>   %48 = load double* %47, align 8 
>> >>   %49 = add i64 %46, 1 
>> >>   %50 = getelementptr double* %7, i64 %49 
>> >>   %51 = load double* %50, align 8 
>> >>   %52 = fmul double %51, %3 
>> >>   %53 = fmul double %38, %48 
>> >>   %54 = fmul double %33, %52 
>> >>   %55 = fadd double %53, %54 
>> >>   store double %55, double* %50, align 8 
>> >>   %56 = fmul double %38, %52 
>> >>   %57 = fmul double %33, %48 
>> >>   %58 = fsub double %56, %57 
>> >>   store double %58, double* %47, align 8 
>> >>   %59 = icmp eq i64 %"#s4.0", %12 
>> >>   br i1 %59, label %L12, label %L9 
>> >> 
>> >> L12:                                              ; preds = %L9 
>> >>   %60 = add i64 %"#s3.0", 1 
>> >>   %61 = icmp eq i64 %"#s3.0", %42 
>> >>   br i1 %61, label %L14.loopexit, label %L9.preheader 
>> >> ``` 
>> >> 
>> >> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <yyc...@gmail.com> wrote: 
>> >>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin....@gmail.com> 
>> wrote: 
>> >>>> I can't really help you debug the IR code, but I can at least say 
>> I'm seeing 
>> >>>> a similar thing. It starts to slow down around just after 0.5, and 
>> doesn't 
>> >>> 
>> >>> Thanks for the confirmation. At least I'm not crazy (or not the only 
>> >>> one to be more precise :P ) 
>> >>> 
>> >>>> get back to where it was at 0.5 until 0.87. Can you compare the IR 
>> code when 
>> >>>> two different values are used, to see what's different? When I tried 
>> looking 
>> >>>> at the difference between 0.50 and 0.51, the biggest thing that 
>> popped out 
>> >>>> to me was that the numbers after "!dbg" were different. 
>> >>> 
>> >>> This is exactly the strange part. I don't think either llvm or julia 
>> >>> is doing constant propagation here and different input values should 
>> >>> be using the same function. Evidents are 
>> >>> 
>> >>> 1. Julia only specialize on types not values (for now) 
>> >>> 2. The function is not inlined. Which has to be the case in global 
>> >>> scope and can be double checked by adding `@noinline`. It's also too 
>> >>> big to be inline_worthy 
>> >>> 3. No codegen is happening except the first call. This can be seen 
>> >>> from the output of `@time`. Only the first call has thousands of 
>> >>> allocations. Following call has less than 5 (on 0.4 at least). If 
>> >>> julia can compile a function with less than 5 allocation, I'll be 
>> very 
>> >>> happy..... 
>> >>> 4. My original version also has much more complicated logic to 
>> compute 
>> >>> that scaling factor (ok. it's just a function call, but with 
>> >>> parameters gathered from different arguments) I'd be really surprised 
>> >>> if either llvm or julia can reason anything about it.... 
>> >>> 
>> >>> The difference in the debug info should just be an artifact of 
>> >>> emitting it twice. 
>> >>> 
>> >>>> 
>> >>>> Even 0.50001 is a lot slower: 
>> >>>> 
>> >>>> julia> for eΓ in 0.5:0.00001:0.50015 
>> >>>>            println(eΓ) 
>> >>>>            gc() 
>> >>>>            @time ψs = propagate(P, ψ0, ψs, eΓ) 
>> >>>>        end 
>> >>>> 0.5 
>> >>>> elapsed time: 0.065609581 seconds (16 bytes allocated) 
>> >>>> 0.50001 
>> >>>> elapsed time: 0.875806461 seconds (16 bytes allocated) 
>> >>> 
>> >>> This is actually interesting and I can confirm the same here. `0.5` 
>> >>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking 
>> >>> whether the cpu is doing sth special depending on the bit pattern but 
>> >>> I still don't understand why it would be very bad for a certain 
>> range. 
>> >>> (and is not only a function of this either, also affected by all 
>> other 
>> >>> values) 
>> >>> 
>> >>>> 
>> >>>> 
>> >>>> julia> versioninfo() 
>> >>>> Julia Version 0.3.9 
>> >>>> Commit 31efe69 (2015-05-30 11:24 UTC) 
>> >>>> Platform Info: 
>> >>>>   System: Darwin (x86_64-apple-darwin13.4.0) 
>> >>> 
>> >>> Good. so it's not Linux only. 
>> >>> 
>> >>>>   CPU: Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz 
>> >>>>   WORD_SIZE: 64 
>> >>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) 
>> >>>>   LAPACK: libopenblas 
>> >>>>   LIBM: libopenlibm 
>> >>>>   LLVM: libLLVM-3.3 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: 
>> >>>>> 
>> >>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> 
>> wrote: 
>> >>>>> > P.S. Given how strange this problem is for me, I would appreciate 
>> if 
>> >>>>> > anyone can confirm either this is a real issue or I'm somehow 
>> being 
>> >>>>> > crazy or stupid. 
>> >>>>> > 
>> >>>>> 
>> >>>>> One additional strange property of this issue is that I used to 
>> have 
>> >>>>> much costy operations in the (outer) loop (the one that iterate 
>> over 
>> >>>>> nsteps with i) like Fourier transformations. However, when the 
>> scaling 
>> >>>>> factor is taking the bad value, it slows everything down (i.e. the 
>> >>>>> Fourier transformation is also slower by ~10x). 
>> >>>>> 
>> >>>>> > 
>> >>>>> > 
>> >>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> 
>> wrote: 
>> >>>>> >> Hi, 
>> >>>>> >> 
>> >>>>> >> I've just seen a very strange (for me) performance difference 
>> for 
>> >>>>> >> exactly the same code on slightly different input with no 
>> explicit 
>> >>>>> >> branches. 
>> >>>>> >> 
>> >>>>> >> The code is available here[1]. The most relavant part is the 
>> following 
>> >>>>> >> function. (All other part of the code are for initialization and 
>> bench 
>> >>>>> >> mark). This is a simplified version of my similation that 
>> compute the 
>> >>>>> >> next array column in the array based on the previous one. 
>> >>>>> >> 
>> >>>>> >> The strange part is that the performance of this function can 
>> differ 
>> >>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only 
>> use 
>> >>>>> >> of which is marked in the code below) even though I don't see 
>> any 
>> >>>>> >> branches that depends on that value in the relavant code. 
>> (unless the 
>> >>>>> >> cpu is 10x less efficient for certain input values) 
>> >>>>> >> 
>> >>>>> >> function propagate(P, ψ0, ψs, eΓ) 
>> >>>>> >>     @inbounds for i in 1:P.nele 
>> >>>>> >>         ψs[1, i, 1] = ψ0[1, i] 
>> >>>>> >>         ψs[2, i, 1] = ψ0[2, i] 
>> >>>>> >>     end 
>> >>>>> >>     T12 = im * sin(P.Ω) 
>> >>>>> >>     T11 = cos(P.Ω) 
>> >>>>> >>     @inbounds for i in 2:(P.nstep + 1) 
>> >>>>> >>         for j in 1:P.nele 
>> >>>>> >>             ψ_e = ψs[1, j, i - 1] 
>> >>>>> >>             ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor 
>> >>>>> >>             ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g 
>> >>>>> >>             ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e 
>> >>>>> >>         end 
>> >>>>> >>     end 
>> >>>>> >>     ψs 
>> >>>>> >> end 
>> >>>>> >> 
>> >>>>> >> The output of the full script is attached and it can be clearly 
>> seen 
>> >>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times 
>> slower 
>> >>>>> >> than others. 
>> >>>>> >> 
>> >>>>> >> The assembly[2] and llvm[3] code of this function is also in the 
>> same 
>> >>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 
>> 3.3 
>> >>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a 
>> linode 
>> >>>>> >> VPS) (the only platform I've tried that doesn't show similar 
>> behavior 
>> >>>>> >> is running julia 0.4 on qemu-arm....... although the performance 
>> >>>>> >> between different values also differ by ~30% which is bigger 
>> than 
>> >>>>> >> noise) 
>> >>>>> >> 
>> >>>>> >> This also seems to depend on the initial value. 
>> >>>>> >> 
>> >>>>> >> Has anyone seen similar problems before? 
>> >>>>> >> 
>> >>>>> >> Outputs: 
>> >>>>> >> 
>> >>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB) 
>> >>>>> >> 307.826 milliseconds (4 allocations: 144 bytes) 
>> >>>>> >> 0.0 
>> >>>>> >>  19.227 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.1 
>> >>>>> >>  17.291 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.2 
>> >>>>> >>  17.404 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.3 
>> >>>>> >>  19.231 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.4 
>> >>>>> >>  20.278 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.5 
>> >>>>> >>  23.692 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.6 
>> >>>>> >> 328.107 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.7 
>> >>>>> >> 312.425 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.8 
>> >>>>> >> 201.494 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 0.9 
>> >>>>> >>  16.314 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 1.0 
>> >>>>> >>  16.264 milliseconds (2 allocations: 48 bytes) 
>> >>>>> >> 
>> >>>>> >> 
>> >>>>> >> [1] 
>> >>>>> >> 
>> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
>>  
>> >>>>> >> [2] 
>> >>>>> >> 
>> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
>>  
>> >>>>> >> [2] 
>> >>>>> >> 
>> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll
>>  
>>
>

Re: [julia-users] Re: Strange performance problem for array scaling

Reply via email to