and this: Cleve Moler tries to see it your way Moler on floating point denormals <http://blogs.mathworks.com/cleve/2014/07/21/floating-point-denormals-insignificant-but-controversial-2/>
On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote: > > Denormals were made part of the IEEE Floating Point standard after some > very careful numerical analysis showed that accomodating them would > substantively improve the quality of floating point results and this would > lift the quality of all floating point work. Surprising it may be, > nonetheless you (and if not you today, you tomorrow and one of your > neighbors today) really do care about those unusual, and often rarely > observed values. > > fyi > William Kahan on the introduction of denormals to the standard > <https://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html> > and an early, important paper on this > Effects of Underflow on Solving Linear Systems - J.Demmel 1981 > <http://www.eecs.berkeley.edu/Pubs/TechRpts/1983/CSD-83-128.pdf> > > > On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote: >> >> On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu <yyc...@gmail.com> wrote: >> > Further update: >> > >> > I made a c++ version[1] and see a similar effect (depending on >> > optimization levels) so it's not a julia issue (not that I think it >> > really was to begin with...). >> >> After investigating the c++ version more, I find that the difference >> between the fast_math and the non-fast_math version is that the >> compiler emit a function called `set_fast_math` (see below). >> >> From what I can tell, the function sets bit 6 and bit 15 on the MXCSR >> register (for SSE) and according to this page[1] these are DAZ and FZ >> bits (both related to underflow). It also describe denormals as "take >> considerably longer to process". Since the operation I have keeps >> decreasing the value, I guess it makes sense that there's a value >> dependent performance.... (and it kind of make sense that fft also >> suffers from these values) >> >> So now the question is: >> >> 1. How important are underflow and denormal values? Note that I'm not >> catching underflow explicitly anyway and I don't really care about >> values that are really small compare to 1. >> >> 2. Is there a way to set up the SSE registers as done by the c >> compilers? @fastmath does not seems to be doing this. >> >> 00000000000005b0 <set_fast_math>: >> 5b0: 0f ae 5c 24 fc stmxcsr -0x4(%rsp) >> 5b5: 81 4c 24 fc 40 80 00 orl $0x8040,-0x4(%rsp) >> 5bc: 00 >> 5bd: 0f ae 54 24 fc ldmxcsr -0x4(%rsp) >> 5c2: c3 retq >> 5c3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) >> 5ca: 00 00 00 >> 5cd: 0f 1f 00 nopl (%rax) >> >> [1] http://softpixel.com/~cwright/programming/simd/sse.php >> >> > >> > The slow down is presented in the c++ version for all optimization >> > levels except Ofast and ffast-math. The julia version is faster than >> > the default performance for both gcc and clang but is slower in the >> > fast case for higher optmization levels. For O2 and higher, the c++ >> >> The slowness of the julia version seems to be due to multi dimentional >> arrays. Using 1d array yields similar performance with C. >> >> > version shows a ~100x slow down for the slow case. >> > >> > @fast_math in julia doesn't seem to have an effect for this although >> > it does for clang and gcc... >> > >> > [1] >> https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp >> >> > >> > On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <yyc...@gmail.com> wrote: >> >> Update: >> >> >> >> I've just got an even simpler version without any complex numbers and >> >> only has Float64. The two loops are as small as the following LLVM-IR >> >> now and there's only simple arithmetics in the loop body. >> >> >> >> ```llvm >> >> L9.preheader: ; preds = %L12, >> >> %L9.preheader.preheader >> >> %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] >> >> br label %L9 >> >> >> >> L9: ; preds = %L9, >> %L9.preheader >> >> %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] >> >> %44 = add i64 %"#s4.0", 1 >> >> %45 = add i64 %"#s4.0", -1 >> >> %46 = mul i64 %45, %10 >> >> %47 = getelementptr double* %7, i64 %46 >> >> %48 = load double* %47, align 8 >> >> %49 = add i64 %46, 1 >> >> %50 = getelementptr double* %7, i64 %49 >> >> %51 = load double* %50, align 8 >> >> %52 = fmul double %51, %3 >> >> %53 = fmul double %38, %48 >> >> %54 = fmul double %33, %52 >> >> %55 = fadd double %53, %54 >> >> store double %55, double* %50, align 8 >> >> %56 = fmul double %38, %52 >> >> %57 = fmul double %33, %48 >> >> %58 = fsub double %56, %57 >> >> store double %58, double* %47, align 8 >> >> %59 = icmp eq i64 %"#s4.0", %12 >> >> br i1 %59, label %L12, label %L9 >> >> >> >> L12: ; preds = %L9 >> >> %60 = add i64 %"#s3.0", 1 >> >> %61 = icmp eq i64 %"#s3.0", %42 >> >> br i1 %61, label %L14.loopexit, label %L9.preheader >> >> ``` >> >> >> >> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <yyc...@gmail.com> wrote: >> >>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin....@gmail.com> >> wrote: >> >>>> I can't really help you debug the IR code, but I can at least say >> I'm seeing >> >>>> a similar thing. It starts to slow down around just after 0.5, and >> doesn't >> >>> >> >>> Thanks for the confirmation. At least I'm not crazy (or not the only >> >>> one to be more precise :P ) >> >>> >> >>>> get back to where it was at 0.5 until 0.87. Can you compare the IR >> code when >> >>>> two different values are used, to see what's different? When I tried >> looking >> >>>> at the difference between 0.50 and 0.51, the biggest thing that >> popped out >> >>>> to me was that the numbers after "!dbg" were different. >> >>> >> >>> This is exactly the strange part. I don't think either llvm or julia >> >>> is doing constant propagation here and different input values should >> >>> be using the same function. Evidents are >> >>> >> >>> 1. Julia only specialize on types not values (for now) >> >>> 2. The function is not inlined. Which has to be the case in global >> >>> scope and can be double checked by adding `@noinline`. It's also too >> >>> big to be inline_worthy >> >>> 3. No codegen is happening except the first call. This can be seen >> >>> from the output of `@time`. Only the first call has thousands of >> >>> allocations. Following call has less than 5 (on 0.4 at least). If >> >>> julia can compile a function with less than 5 allocation, I'll be >> very >> >>> happy..... >> >>> 4. My original version also has much more complicated logic to >> compute >> >>> that scaling factor (ok. it's just a function call, but with >> >>> parameters gathered from different arguments) I'd be really surprised >> >>> if either llvm or julia can reason anything about it.... >> >>> >> >>> The difference in the debug info should just be an artifact of >> >>> emitting it twice. >> >>> >> >>>> >> >>>> Even 0.50001 is a lot slower: >> >>>> >> >>>> julia> for eΓ in 0.5:0.00001:0.50015 >> >>>> println(eΓ) >> >>>> gc() >> >>>> @time ψs = propagate(P, ψ0, ψs, eΓ) >> >>>> end >> >>>> 0.5 >> >>>> elapsed time: 0.065609581 seconds (16 bytes allocated) >> >>>> 0.50001 >> >>>> elapsed time: 0.875806461 seconds (16 bytes allocated) >> >>> >> >>> This is actually interesting and I can confirm the same here. `0.5` >> >>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking >> >>> whether the cpu is doing sth special depending on the bit pattern but >> >>> I still don't understand why it would be very bad for a certain >> range. >> >>> (and is not only a function of this either, also affected by all >> other >> >>> values) >> >>> >> >>>> >> >>>> >> >>>> julia> versioninfo() >> >>>> Julia Version 0.3.9 >> >>>> Commit 31efe69 (2015-05-30 11:24 UTC) >> >>>> Platform Info: >> >>>> System: Darwin (x86_64-apple-darwin13.4.0) >> >>> >> >>> Good. so it's not Linux only. >> >>> >> >>>> CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz >> >>>> WORD_SIZE: 64 >> >>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) >> >>>> LAPACK: libopenblas >> >>>> LIBM: libopenlibm >> >>>> LLVM: libLLVM-3.3 >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: >> >>>>> >> >>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> >> wrote: >> >>>>> > P.S. Given how strange this problem is for me, I would appreciate >> if >> >>>>> > anyone can confirm either this is a real issue or I'm somehow >> being >> >>>>> > crazy or stupid. >> >>>>> > >> >>>>> >> >>>>> One additional strange property of this issue is that I used to >> have >> >>>>> much costy operations in the (outer) loop (the one that iterate >> over >> >>>>> nsteps with i) like Fourier transformations. However, when the >> scaling >> >>>>> factor is taking the bad value, it slows everything down (i.e. the >> >>>>> Fourier transformation is also slower by ~10x). >> >>>>> >> >>>>> > >> >>>>> > >> >>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> >> wrote: >> >>>>> >> Hi, >> >>>>> >> >> >>>>> >> I've just seen a very strange (for me) performance difference >> for >> >>>>> >> exactly the same code on slightly different input with no >> explicit >> >>>>> >> branches. >> >>>>> >> >> >>>>> >> The code is available here[1]. The most relavant part is the >> following >> >>>>> >> function. (All other part of the code are for initialization and >> bench >> >>>>> >> mark). This is a simplified version of my similation that >> compute the >> >>>>> >> next array column in the array based on the previous one. >> >>>>> >> >> >>>>> >> The strange part is that the performance of this function can >> differ >> >>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only >> use >> >>>>> >> of which is marked in the code below) even though I don't see >> any >> >>>>> >> branches that depends on that value in the relavant code. >> (unless the >> >>>>> >> cpu is 10x less efficient for certain input values) >> >>>>> >> >> >>>>> >> function propagate(P, ψ0, ψs, eΓ) >> >>>>> >> @inbounds for i in 1:P.nele >> >>>>> >> ψs[1, i, 1] = ψ0[1, i] >> >>>>> >> ψs[2, i, 1] = ψ0[2, i] >> >>>>> >> end >> >>>>> >> T12 = im * sin(P.Ω) >> >>>>> >> T11 = cos(P.Ω) >> >>>>> >> @inbounds for i in 2:(P.nstep + 1) >> >>>>> >> for j in 1:P.nele >> >>>>> >> ψ_e = ψs[1, j, i - 1] >> >>>>> >> ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor >> >>>>> >> ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g >> >>>>> >> ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e >> >>>>> >> end >> >>>>> >> end >> >>>>> >> ψs >> >>>>> >> end >> >>>>> >> >> >>>>> >> The output of the full script is attached and it can be clearly >> seen >> >>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times >> slower >> >>>>> >> than others. >> >>>>> >> >> >>>>> >> The assembly[2] and llvm[3] code of this function is also in the >> same >> >>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM >> 3.3 >> >>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a >> linode >> >>>>> >> VPS) (the only platform I've tried that doesn't show similar >> behavior >> >>>>> >> is running julia 0.4 on qemu-arm....... although the performance >> >>>>> >> between different values also differ by ~30% which is bigger >> than >> >>>>> >> noise) >> >>>>> >> >> >>>>> >> This also seems to depend on the initial value. >> >>>>> >> >> >>>>> >> Has anyone seen similar problems before? >> >>>>> >> >> >>>>> >> Outputs: >> >>>>> >> >> >>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB) >> >>>>> >> 307.826 milliseconds (4 allocations: 144 bytes) >> >>>>> >> 0.0 >> >>>>> >> 19.227 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.1 >> >>>>> >> 17.291 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.2 >> >>>>> >> 17.404 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.3 >> >>>>> >> 19.231 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.4 >> >>>>> >> 20.278 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.5 >> >>>>> >> 23.692 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.6 >> >>>>> >> 328.107 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.7 >> >>>>> >> 312.425 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.8 >> >>>>> >> 201.494 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 0.9 >> >>>>> >> 16.314 milliseconds (2 allocations: 48 bytes) >> >>>>> >> 1.0 >> >>>>> >> 16.264 milliseconds (2 allocations: 48 bytes) >> >>>>> >> >> >>>>> >> >> >>>>> >> [1] >> >>>>> >> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl >> >> >>>>> >> [2] >> >>>>> >> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S >> >> >>>>> >> [2] >> >>>>> >> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll >> >> >