On Mon, Jul 13, 2015 at 2:20 AM, Jeffrey Sarnoff <[email protected]> wrote: > and this: Cleve Moler tries to see it your way > Moler on floating point denormals > > > On Monday, July 13, 2015 at 2:11:22 AM UTC-4, Jeffrey Sarnoff wrote: >> >> Denormals were made part of the IEEE Floating Point standard after some >> very careful numerical analysis showed that accomodating them would >> substantively improve the quality of floating point results and this would >> lift the quality of all floating point work. Surprising it may be, >> nonetheless you (and if not you today, you tomorrow and one of your >> neighbors today) really do care about those unusual, and often rarely >> observed values. >> >> fyi >> William Kahan on the introduction of denormals to the standard >> and an early, important paper on this >> Effects of Underflow on Solving Linear Systems - J.Demmel 1981 >>
Thank you very much for the references. Yes I definitely believe every part of the IEEE Floating Point standard has it's reason to be there and I'm more wondering what are the cases that they are significant. OTOH, I also believe there's certain kind of computation that do not care about them, which is why -ffast-math is there. >From the mathwork blog reference: > Double precision denormals are so tiny that they are rarely numerically > significant, but single precision denormals can be in the range where they > affect some otherwise unremarkable computations. For my computation, I'm currently using double but I've already checked that switching to single precision still give me enough precision. Based on this, can I say that I can ignore them if I use double precision and may need to keep them if I switch to single precision? Using Float64 takes twice as long as using Float32 while keeping denormals seems to take 10x time..... As for doing it in julia, I found @simonbyrne's mxcsr.jl[1]. However, I couldn't get it working without #11604[2]. Inline assembly in llvmcall is working on LLVM 3.6 though[3], in case it's useful for others. [1] https://gist.github.com/simonbyrne/9c1e4704be46b66b1485 [2] https://github.com/JuliaLang/julia/pull/11604 [3] https://github.com/yuyichao/explore/blob/a47cef8c84ad3f43b18e0fd797dca9debccdd250/julia/array_prop/array_prop.jl#L3 >> >> On Monday, July 13, 2015 at 12:35:24 AM UTC-4, Yichao Yu wrote: >>> >>> On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu <[email protected]> wrote: >>> > Further update: >>> > >>> > I made a c++ version[1] and see a similar effect (depending on >>> > optimization levels) so it's not a julia issue (not that I think it >>> > really was to begin with...). >>> >>> After investigating the c++ version more, I find that the difference >>> between the fast_math and the non-fast_math version is that the >>> compiler emit a function called `set_fast_math` (see below). >>> >>> From what I can tell, the function sets bit 6 and bit 15 on the MXCSR >>> register (for SSE) and according to this page[1] these are DAZ and FZ >>> bits (both related to underflow). It also describe denormals as "take >>> considerably longer to process". Since the operation I have keeps >>> decreasing the value, I guess it makes sense that there's a value >>> dependent performance.... (and it kind of make sense that fft also >>> suffers from these values) >>> >>> So now the question is: >>> >>> 1. How important are underflow and denormal values? Note that I'm not >>> catching underflow explicitly anyway and I don't really care about >>> values that are really small compare to 1. >>> >>> 2. Is there a way to set up the SSE registers as done by the c >>> compilers? @fastmath does not seems to be doing this. >>> >>> 00000000000005b0 <set_fast_math>: >>> 5b0: 0f ae 5c 24 fc stmxcsr -0x4(%rsp) >>> 5b5: 81 4c 24 fc 40 80 00 orl $0x8040,-0x4(%rsp) >>> 5bc: 00 >>> 5bd: 0f ae 54 24 fc ldmxcsr -0x4(%rsp) >>> 5c2: c3 retq >>> 5c3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) >>> 5ca: 00 00 00 >>> 5cd: 0f 1f 00 nopl (%rax) >>> >>> [1] http://softpixel.com/~cwright/programming/simd/sse.php >>> >>> > >>> > The slow down is presented in the c++ version for all optimization >>> > levels except Ofast and ffast-math. The julia version is faster than >>> > the default performance for both gcc and clang but is slower in the >>> > fast case for higher optmization levels. For O2 and higher, the c++ >>> >>> The slowness of the julia version seems to be due to multi dimentional >>> arrays. Using 1d array yields similar performance with C. >>> >>> > version shows a ~100x slow down for the slow case. >>> > >>> > @fast_math in julia doesn't seem to have an effect for this although >>> > it does for clang and gcc... >>> > >>> > [1] >>> > https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp >>> > >>> > On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <[email protected]> wrote: >>> >> Update: >>> >> >>> >> I've just got an even simpler version without any complex numbers and >>> >> only has Float64. The two loops are as small as the following LLVM-IR >>> >> now and there's only simple arithmetics in the loop body. >>> >> >>> >> ```llvm >>> >> L9.preheader: ; preds = %L12, >>> >> %L9.preheader.preheader >>> >> %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] >>> >> br label %L9 >>> >> >>> >> L9: ; preds = %L9, >>> >> %L9.preheader >>> >> %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] >>> >> %44 = add i64 %"#s4.0", 1 >>> >> %45 = add i64 %"#s4.0", -1 >>> >> %46 = mul i64 %45, %10 >>> >> %47 = getelementptr double* %7, i64 %46 >>> >> %48 = load double* %47, align 8 >>> >> %49 = add i64 %46, 1 >>> >> %50 = getelementptr double* %7, i64 %49 >>> >> %51 = load double* %50, align 8 >>> >> %52 = fmul double %51, %3 >>> >> %53 = fmul double %38, %48 >>> >> %54 = fmul double %33, %52 >>> >> %55 = fadd double %53, %54 >>> >> store double %55, double* %50, align 8 >>> >> %56 = fmul double %38, %52 >>> >> %57 = fmul double %33, %48 >>> >> %58 = fsub double %56, %57 >>> >> store double %58, double* %47, align 8 >>> >> %59 = icmp eq i64 %"#s4.0", %12 >>> >> br i1 %59, label %L12, label %L9 >>> >> >>> >> L12: ; preds = %L9 >>> >> %60 = add i64 %"#s3.0", 1 >>> >> %61 = icmp eq i64 %"#s3.0", %42 >>> >> br i1 %61, label %L14.loopexit, label %L9.preheader >>> >> ``` >>> >> >>> >> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <[email protected]> wrote: >>> >>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <[email protected]> >>> >>> wrote: >>> >>>> I can't really help you debug the IR code, but I can at least say >>> >>>> I'm seeing >>> >>>> a similar thing. It starts to slow down around just after 0.5, and >>> >>>> doesn't >>> >>> >>> >>> Thanks for the confirmation. At least I'm not crazy (or not the only >>> >>> one to be more precise :P ) >>> >>> >>> >>>> get back to where it was at 0.5 until 0.87. Can you compare the IR >>> >>>> code when >>> >>>> two different values are used, to see what's different? When I tried >>> >>>> looking >>> >>>> at the difference between 0.50 and 0.51, the biggest thing that >>> >>>> popped out >>> >>>> to me was that the numbers after "!dbg" were different. >>> >>> >>> >>> This is exactly the strange part. I don't think either llvm or julia >>> >>> is doing constant propagation here and different input values should >>> >>> be using the same function. Evidents are >>> >>> >>> >>> 1. Julia only specialize on types not values (for now) >>> >>> 2. The function is not inlined. Which has to be the case in global >>> >>> scope and can be double checked by adding `@noinline`. It's also too >>> >>> big to be inline_worthy >>> >>> 3. No codegen is happening except the first call. This can be seen >>> >>> from the output of `@time`. Only the first call has thousands of >>> >>> allocations. Following call has less than 5 (on 0.4 at least). If >>> >>> julia can compile a function with less than 5 allocation, I'll be >>> >>> very >>> >>> happy..... >>> >>> 4. My original version also has much more complicated logic to >>> >>> compute >>> >>> that scaling factor (ok. it's just a function call, but with >>> >>> parameters gathered from different arguments) I'd be really surprised >>> >>> if either llvm or julia can reason anything about it.... >>> >>> >>> >>> The difference in the debug info should just be an artifact of >>> >>> emitting it twice. >>> >>> >>> >>>> >>> >>>> Even 0.50001 is a lot slower: >>> >>>> >>> >>>> julia> for eΓ in 0.5:0.00001:0.50015 >>> >>>> println(eΓ) >>> >>>> gc() >>> >>>> @time ψs = propagate(P, ψ0, ψs, eΓ) >>> >>>> end >>> >>>> 0.5 >>> >>>> elapsed time: 0.065609581 seconds (16 bytes allocated) >>> >>>> 0.50001 >>> >>>> elapsed time: 0.875806461 seconds (16 bytes allocated) >>> >>> >>> >>> This is actually interesting and I can confirm the same here. `0.5` >>> >>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking >>> >>> whether the cpu is doing sth special depending on the bit pattern but >>> >>> I still don't understand why it would be very bad for a certain >>> >>> range. >>> >>> (and is not only a function of this either, also affected by all >>> >>> other >>> >>> values) >>> >>> >>> >>>> >>> >>>> >>> >>>> julia> versioninfo() >>> >>>> Julia Version 0.3.9 >>> >>>> Commit 31efe69 (2015-05-30 11:24 UTC) >>> >>>> Platform Info: >>> >>>> System: Darwin (x86_64-apple-darwin13.4.0) >>> >>> >>> >>> Good. so it's not Linux only. >>> >>> >>> >>>> CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz >>> >>>> WORD_SIZE: 64 >>> >>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) >>> >>>> LAPACK: libopenblas >>> >>>> LIBM: libopenlibm >>> >>>> LLVM: libLLVM-3.3 >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: >>> >>>>> >>> >>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <[email protected]> >>> >>>>> wrote: >>> >>>>> > P.S. Given how strange this problem is for me, I would appreciate >>> >>>>> > if >>> >>>>> > anyone can confirm either this is a real issue or I'm somehow >>> >>>>> > being >>> >>>>> > crazy or stupid. >>> >>>>> > >>> >>>>> >>> >>>>> One additional strange property of this issue is that I used to >>> >>>>> have >>> >>>>> much costy operations in the (outer) loop (the one that iterate >>> >>>>> over >>> >>>>> nsteps with i) like Fourier transformations. However, when the >>> >>>>> scaling >>> >>>>> factor is taking the bad value, it slows everything down (i.e. the >>> >>>>> Fourier transformation is also slower by ~10x). >>> >>>>> >>> >>>>> > >>> >>>>> > >>> >>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <[email protected]> >>> >>>>> > wrote: >>> >>>>> >> Hi, >>> >>>>> >> >>> >>>>> >> I've just seen a very strange (for me) performance difference >>> >>>>> >> for >>> >>>>> >> exactly the same code on slightly different input with no >>> >>>>> >> explicit >>> >>>>> >> branches. >>> >>>>> >> >>> >>>>> >> The code is available here[1]. The most relavant part is the >>> >>>>> >> following >>> >>>>> >> function. (All other part of the code are for initialization and >>> >>>>> >> bench >>> >>>>> >> mark). This is a simplified version of my similation that >>> >>>>> >> compute the >>> >>>>> >> next array column in the array based on the previous one. >>> >>>>> >> >>> >>>>> >> The strange part is that the performance of this function can >>> >>>>> >> differ >>> >>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only >>> >>>>> >> use >>> >>>>> >> of which is marked in the code below) even though I don't see >>> >>>>> >> any >>> >>>>> >> branches that depends on that value in the relavant code. >>> >>>>> >> (unless the >>> >>>>> >> cpu is 10x less efficient for certain input values) >>> >>>>> >> >>> >>>>> >> function propagate(P, ψ0, ψs, eΓ) >>> >>>>> >> @inbounds for i in 1:P.nele >>> >>>>> >> ψs[1, i, 1] = ψ0[1, i] >>> >>>>> >> ψs[2, i, 1] = ψ0[2, i] >>> >>>>> >> end >>> >>>>> >> T12 = im * sin(P.Ω) >>> >>>>> >> T11 = cos(P.Ω) >>> >>>>> >> @inbounds for i in 2:(P.nstep + 1) >>> >>>>> >> for j in 1:P.nele >>> >>>>> >> ψ_e = ψs[1, j, i - 1] >>> >>>>> >> ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor >>> >>>>> >> ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g >>> >>>>> >> ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e >>> >>>>> >> end >>> >>>>> >> end >>> >>>>> >> ψs >>> >>>>> >> end >>> >>>>> >> >>> >>>>> >> The output of the full script is attached and it can be clearly >>> >>>>> >> seen >>> >>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times >>> >>>>> >> slower >>> >>>>> >> than others. >>> >>>>> >> >>> >>>>> >> The assembly[2] and llvm[3] code of this function is also in the >>> >>>>> >> same >>> >>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM >>> >>>>> >> 3.3 >>> >>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a >>> >>>>> >> linode >>> >>>>> >> VPS) (the only platform I've tried that doesn't show similar >>> >>>>> >> behavior >>> >>>>> >> is running julia 0.4 on qemu-arm....... although the performance >>> >>>>> >> between different values also differ by ~30% which is bigger >>> >>>>> >> than >>> >>>>> >> noise) >>> >>>>> >> >>> >>>>> >> This also seems to depend on the initial value. >>> >>>>> >> >>> >>>>> >> Has anyone seen similar problems before? >>> >>>>> >> >>> >>>>> >> Outputs: >>> >>>>> >> >>> >>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB) >>> >>>>> >> 307.826 milliseconds (4 allocations: 144 bytes) >>> >>>>> >> 0.0 >>> >>>>> >> 19.227 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.1 >>> >>>>> >> 17.291 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.2 >>> >>>>> >> 17.404 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.3 >>> >>>>> >> 19.231 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.4 >>> >>>>> >> 20.278 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.5 >>> >>>>> >> 23.692 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.6 >>> >>>>> >> 328.107 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.7 >>> >>>>> >> 312.425 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.8 >>> >>>>> >> 201.494 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 0.9 >>> >>>>> >> 16.314 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> 1.0 >>> >>>>> >> 16.264 milliseconds (2 allocations: 48 bytes) >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> [1] >>> >>>>> >> >>> >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl >>> >>>>> >> [2] >>> >>>>> >> >>> >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S >>> >>>>> >> [2] >>> >>>>> >> >>> >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll
