Re: [Haskell-cafe] Known Unknowns
--- Ketil Malde [EMAIL PROTECTED] wrote: Isaac Gouy [EMAIL PROTECTED] writes: Programmer skill and effort really does matter ;-) Yes, more so, than any inherent language disadvantage, perhaps, which happens to be the general lesson from the ICFP contests as well. Any idea if other languages have seen similar efforts? FreePascal and Smart Effiel, somewhat - and there have been excellent individual efforts with Lua and Tcl and ... imo the Haskell Cafe discussions and wiki have been a more open and shared learning experience than we usually see, and maybe some of the success stems from that collaboration and competition. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
Joel Koerwer wrote: Don, that's a great little mini tutorial, exactly what I was hoping for. I'm looking forward to learning more tricks. On an unrelated note, I have an STUArray nbody. I haven't really looked closely at the chris+dons version, but I suspect they amount to doing the same thing. I get commensurate runtimes at least. But I'll post it on the wiki in a while in case there is some optimization I missed. On an even more unrelated note, I get slower runtimes with -optc-O3 and -optc-ffast-math than without. Individually or in tandem. Odd. This is ghc 6.4.1 and gcc 4.0.3 on a (Banias) Pentium M. More architecture benchmarking: On a powerbook G4, Joel Koerwer's entry on http://haskell.org/hawiki/NbodyEntry runs faster than dons+chris. And I edited it to make it smaller, but by hoisting 'size' and 'dt' I also made it run quite a bit faster. It now takes 1.7x less time than dons+chris. Don, could you check the speed on your architecture? -- Chris ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
haskell: Joel Koerwer wrote: Don, that's a great little mini tutorial, exactly what I was hoping for. I'm looking forward to learning more tricks. On an unrelated note, I have an STUArray nbody. I haven't really looked closely at the chris+dons version, but I suspect they amount to doing the same thing. I get commensurate runtimes at least. But I'll post it on the wiki in a while in case there is some optimization I missed. On an even more unrelated note, I get slower runtimes with -optc-O3 and -optc-ffast-math than without. Individually or in tandem. Odd. This is ghc 6.4.1 and gcc 4.0.3 on a (Banias) Pentium M. More architecture benchmarking: On a powerbook G4, Joel Koerwer's entry on http://haskell.org/hawiki/NbodyEntry runs faster than dons+chris. And I edited it to make it smaller, but by hoisting 'size' and 'dt' I also made it run quite a bit faster. It now takes 1.7x less time than dons+chris. Don, could you check the speed on your architecture? Yes! I was missing the -funbox-strict-fields, it makes a huge difference. Check the wiki page, these stuarrays are the fastest yet. Cheers, Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
Hey this is great. Chris your improvements are awesome. I mean the speed is nice, but you really cleaned up the code. There's an extraneous call to energy in the second runST block, but it should be insignificant. Also, -fglasgow-exts is necessary for the left-hand-side type declarations of size and dt. One question, in: calcMomentum (i+1) $! (px+vx*m,py+vy*m,pz+vz*m) that $! doesn't actually do much for a tuple, does it? Of course, there's not much point in further optimizing the initialization routine, as we'd never be able to detect the difference in runtime. The shootout has been a great learning tool for me :-) Thanks to Chris, Don, and the rest of the Haskell community. Joel ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
Donald Bruce Stewart wrote: This entry in fact runs faster than the original (though not the new vectorised entry) optimised C entry (and faster than all other languages): http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsumslang=all So, by carefully tweaking things, we first squished a space leak, and then gained another 30%. In summary: * Check the Core that is generated * Watch out for optimisations that are missed * Read the generated C for the tight loops. * Make sure tight loops are unboxed * Use -fexcess-precision and -optc-ffast-math for doubles This is roughly the process I used for the other shootout entries. Cheers, Don I just looked hard at the new vectorised entry and the original entry for C. In both, the last two functions, which use the alt-ernating sign, are *not* done in the required naive fashion: sum = 0.0; for (k = 1; k = n-1; k += 2) sum += 1.0/kd; for (k = 2; k = n; k += 2) sum -= 1.0/kd; printf(%.9f\tAlternating Harmonic\n, sum); sum = 0.0; for (k = 1; k = 2*n-1; k += 4) sum += 1.0/kd; for (k = 3; k = 2*n; k += 4) sum -= 1.0/kd; printf(%.9f\tGregory\n, sum); As you can see, all the positive terms are added to sum, then all the negative terms. The double precision math comes to a different result, but this is hidden by printing only 9 digits. I just modified the g++ entry and the Haskell entry which do it right and the c entry to print more digits (e.g. show sum in Haskell). The Haskell entry and g++ entry agree, as expected. The c entry does *not* agree: AltHarm 0.6931469805600938 Haskell 0.6931469805600938283163259256980381906032562255859375... g++ 0.69314698056038037687898167860112152993679046630859375000... gcc Gregory 0.7853980633974356 Haskell 0.785398063397435564070292457472532987594604492187500... g++ 0.7853980633973864922126040255534462630748748779296875000... gcc The gcc entry is computing a different answer since it uses the wrong order for making the partial sum. -- Chris ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
Bulat Ziganshin wrote: Hello Donald, Wednesday, February 01, 2006, 8:00:04 AM, you wrote: DBS Here's a brief introduction. I intend to write up (on the performance page on DBS the wiki) a list of things we've done to improve the shootout entries. N.B DBS we're now the 3rd *fastest* language, behind C and only a little behind D (a C DBS varient) !! 3rd fastest or 3rd overall, counting program lines and so on? That's the unconceivable thing. It is 3rd fastest. Looking at just Full CPU Time: C gcc 35.90 3 D Digital Mars 32.72 3 Haskell GHC 30.25 0 SML MLton 28.72 3 OCaml 27.92 1 Eiffel Smart26.17 6 C++ g++ 25.73 3 Nice24.43 4 Ada 95 GNAT 23.45 4 Clean 23.32 7 Java JDK 1.4 -server22.69 5 Java JDK -server22.39 5 Java JDK -client19.19 5 C# Mono 16.99 2 Only C gcc and D Digital Mars are ahead. Looking at Just Memory Use, Haskell is 8th languagescore missing C gcc 39.00 3 D Digital Mars 29.21 3 Forth GForth28.63 2 Ada 95 GNAT 27.12 4 Pascal Free 26.53 7 Eiffel Smart24.53 6 C++ g++ 24.46 3 Haskell GHC 24.28 0 OCaml 21.55 1 Fortran G95 20.21 6 Lua 19.63 2 SML MLton 17.80 3 Looking at Just Lines Of Code, Haskell is 1st by a mile: Haskell GHC 41.84 0 SML MLton 34.47 3 Forth GForth32.50 2 OCaml 30.86 1 Tcl 30.83 3 Python Psyco30.49 0 Python 30.33 1 Lua 29.17 2 Ruby27.69 4 Perl25.50 5 Nice25.09 4 C# Mono 24.59 2 D Digital Mars 22.79 3 C++ g++ 22.60 3 Java JDK -client21.12 5 Java JDK -Xint 21.12 5 Java JDK 1.4 -server21.12 5 Java JDK -server21.12 5 C gcc 20.98 3 Where I had to include lots of languages to get down to C gcc. Lookat at the 1:1:1 even balance of the above three, Haskell is 1st: Haskell GHC 96.37 0 C gcc 95.87 3 D Digital Mars 84.72 3 SML MLton 81.00 3 OCaml 80.33 1 Forth GForth75.17 2 C++ g++ 72.79 3 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
--- Chris Kuklewicz [EMAIL PROTECTED] wrote: -snip,snip- It is 3rd fastest. Looking at Just Memory Use, Haskell is 8th Looking at Just Lines Of Code, Haskell is 1st Lookat at the 1:1:1 even balance Haskell is 1st Programmer skill and effort really does matter ;-) Congratulations. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
joelkoerwer: Thanks Chris. I was actually asking about analyzing Core output in general. I'm well aware of the problems we're having with the nbody entry. I'm convinced my list based version can go faster than it is now. That's why I was asking if Don could put together a few notes on how to optimize inner loops using -ddump-simpl and the resulting Core code. Here's a brief introduction. I intend to write up (on the performance page on the wiki) a list of things we've done to improve the shootout entries. N.B we're now the 3rd *fastest* language, behind C and only a little behind D (a C varient) !! Consider the partial sums problem: wiki: http://www.haskell.org/hawiki/PartialSumsEntry site: http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsumslang=ghcid=2 What follows is a discussion of the steps I took to improve the performance of this code. Here's the naive translation of the Clean entry (which was fairly quick): Lots of math in a tight loop. import System; import Numeric main = do n - getArgs = readIO . head let sums = loop 1 n 1 0 0 0 0 0 0 0 0 0 fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ \t ++ t mapM_ (fn :: (Double, String) - IO ()) (zip sums names) names = [(2/3)^k, k^-0.5, 1/k(k+1), Flint Hills, Cookson Hills , Harmonic, Riemann Zeta, Alternating Harmonic, Gregory] loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | k n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k*k*k * sin k * sin k)) (a5 + 1 / (k*k*k * cos k * cos k)) (a6 + 1 / k) (a7 + 1 / (k*k)) (a8 + alt / k) (a9 + alt / (2 * k - 1)) Compiled with -O2. However, the performance is _really_ bad :/ Somewhere greater than 128M heap, in fact eventually running out of memory on my laptop. A classic space leak. (2) So look at the generated core. ghc -o naive Naive.hs -O2 -ddump-simpl | less And we find that our loop has the following type: $sloop_r2U6 :: GHC.Prim.Double# - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Prim.Double# - [GHC.Float.Double] Hmm. Ok, I certainly don't want boxed doubles in such a tight loop. (3) My next step is to encourage GHC to unbox this loop, by providing some strictness annotations. Now the loop looks like this: loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | () !k !n !False = undefined | k n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k*k*k * sin k * sin k)) (a5 + 1 / (k*k*k * cos k * cos k)) (a6 + 1 / k) (a7 + 1 / (k*k)) (a8 + alt / k) (a9 + alt / (2 * k - 1)) where x ! y = x `seq` y I've played a little game here, using ! for `seq`, reminiscent of the new !-pattern proposal for strictness. Let's see how this compiles. Here's the Core: $sloop_r2Vh :: GHC.Prim.Double# - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Float.Double - GHC.Prim.Double# - [GHC.Float.Double] Ok, so it unboxed one extra argument. Let's see if we can get them all unboxed. Strictify all args, and GHC produces an inner loop of: $sloop_r2WS :: GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - GHC.Prim.Double# - [GHC.Float.Double] Ah! perfect. Let's see how that runs: $ ghc Naive.hs -O2 -no-recomp $ time ./a.out 250 3.0 (2/3)^k 3160.817621887
Re: [Haskell-cafe] Known Unknowns
Thanks Chris. I was actually asking about analyzing Core output in general. I'm well aware of the problems we're having with the nbody entry.I'm convinced my list based version can go faster than it is now. That's why I was asking if Don could put together a few notes on how to optimize inner loops using -ddump-simpl and the resulting Core code. So, I guess my request is along the same lines as your earlier one: Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick. I would love to have a tutorial about common situations one comes across when looking at the Core output. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
Joel Koerwer wrote: On 1/26/06, *Donald Bruce Stewart* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Ah, i just do: ghc A.hs -O2 -ddump-simpl | less and then read the Core, keeping an eye on the functions I'm interested in, and checking they're compiling to the kind of loops I'd write by hand. This is particularly useful for the kinds of tight numeric loops used in some of the shootout entries. Cheers, Don In that case could you describe the kind of loops you'd write by hand? See below for the pseudo-code loop and the Haskell version. Seriously. And perhaps typical problems/fixes when the compiler doesn't produce what you want. We don't have any fixes. Thanks, Joel More discussion and code is at http://haskell.org/hawiki/NbodyEntry The compiler produces code that runs 4 times slower than OCaml in our current best attempt at programming against a 40 element (IOUArray Int Double). The final programs speed is very architecture dependent, but more frustrating is that small referentially transparent changes to the source code produce up to factor-of-two fluctuations in run time. The small numeric functions in the shootout, where there is a recursive function with 1 or 2 parameters (Double's), perform quite well. But manipulating this medium number of Double's to model the solar system has been too slow. The main loop for the 5 planets looks quite simple in pseudo-c: deltaTime = 0.01 for (i=0 ; i5; ++i) { get mass m, position (x,y,z), velocity (vx,vy,vz) of particle number i for (j=(i+1); j5; ++j) { get mass, position, velocity of particle j dxyx = position of i - position of j mag = deltaTime /(length of dxyz)^3 velocity of j += mass of i * mag * dxyz velocity of i -= mass of j * mag * dxyz } position of i += deltaTime * velocity of i } Note that the inner loop for j starts a j=(i+1). The best performing Haskell code, for this loop, so far is: -- Offsets for each field x = 0; y = 1; z = 2; vx= 3; vy= 4; vz= 5; m = 6 -- This is the main code. Essentially all the time is spent here advance n = when (n 0) $ updateVel 0 advance (pred n) where updateVel i = when (i = nbodies) $ do let i' = (.|. shift i 3) im - unsafeRead b (i' m) ix - unsafeRead b (i' x) iy - unsafeRead b (i' y) iz - unsafeRead b (i' z) ivx - unsafeRead b (i' vx) ivy - unsafeRead b (i' vy) ivz - unsafeRead b (i' vz) let updateVel' ivx ivy ivz j = ivx `seq` ivy `seq` ivz `seq` if j nbodies then do unsafeWrite b (i' vx) ivx unsafeWrite b (i' vy) ivy unsafeWrite b (i' vz) ivz else do let j' = (.|. shiftL j 3) jm - unsafeRead b (j' m) dx - liftM (ix-) (unsafeRead b (j' x)) dy - liftM (iy-) (unsafeRead b (j' y)) dz - liftM (iz-) (unsafeRead b (j' z)) let distance = sqrt (dx*dx+dy*dy+dz*dz) mag = 0.01 / (distance * distance * distance) addScaled3 (3 .|. (shiftL j 3)) ( im*mag) dx dy dz let a = -jm*mag ivx' = ivx+a*dx ivy' = ivy+a*dy ivz' = ivz+a*dz updateVel' ivx' ivy' ivz' $! (j+1) updateVel' ivx ivy ivz $! (i+1) addScaled (shiftL i 3) 0.01 (3 .|. (shiftL i 3)) updateVel (i+1) -- Helper functions addScaled i a j | i `seq` a `seq` j `seq` False = undefined -- stricitfy addScaled i a j = do set i1 = liftM2 scale (unsafeRead b i1) (unsafeRead b j1) set i2 = liftM2 scale (unsafeRead b i2) (unsafeRead b j2) set i3 = liftM2 scale (unsafeRead b i3) (unsafeRead b j3) where scale old new = old + a * new i1 = i; i2 = succ i1; i3 = succ i2; j1 = j; j2 = succ j1; j3 = succ j2; addScaled3 i a jx jy jz | i `seq` a `seq` jx `seq` jy `seq` jz `seq` False = undefined addScaled3 i a jx jy jz = do set i1 = liftM (scale jx) (unsafeRead b i1) set i2 = liftM (scale jy) (unsafeRead b i2) set i3 = liftM (scale jz) (unsafeRead b i3) where scale new old = a * new + old i1 = i; i2 = succ i1; i3 = succ i2; ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
Donald Bruce Stewart wrote: haskell: There is a new combined benchmark, partial sums that subsumes several earlier benchmarks and runs 9 different numerical calculations: http://haskell.org/hawiki/PartialSumsEntry Ah! I had an entry too. I've posted it on the wiki. I was careful to watch that all loops are compiled into nice unboxed ones in the Core. It seems to run a little bit faster than your more abstracted code. Timings on the page. Also, -fasm seems to only be a benefit on the Mac, as you've pointed out previously. Maybe you could check the times on the Mac too? -- Don Yeah. I had not tried all the compiler options. Using -fasm is slower on this for me as well. I suspect that since your code will beat the entries that have been posted so far, so I thin you should submit it. Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick. -- Chris ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
haskell: Donald Bruce Stewart wrote: haskell: There is a new combined benchmark, partial sums that subsumes several earlier benchmarks and runs 9 different numerical calculations: http://haskell.org/hawiki/PartialSumsEntry Ah! I had an entry too. I've posted it on the wiki. I was careful to watch that all loops are compiled into nice unboxed ones in the Core. It seems to run a little bit faster than your more abstracted code. Timings on the page. Also, -fasm seems to only be a benefit on the Mac, as you've pointed out previously. Maybe you could check the times on the Mac too? -- Don Yeah. I had not tried all the compiler options. Using -fasm is slower on this for me as well. I suspect that since your code will beat the entries that have been posted so far, so I thin you should submit it. ok, I'll submit it. Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick. Ah, i just do: ghc A.hs -O2 -ddump-simpl | less and then read the Core, keeping an eye on the functions I'm interested in, and checking they're compiling to the kind of loops I'd write by hand. This is particularly useful for the kinds of tight numeric loops used in some of the shootout entries. Cheers, Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Known Unknowns
haskell: There is a new combined benchmark, partial sums that subsumes several earlier benchmarks and runs 9 different numerical calculations: http://haskell.org/hawiki/PartialSumsEntry Ah! I had an entry too. I've posted it on the wiki. I was careful to watch that all loops are compiled into nice unboxed ones in the Core. It seems to run a little bit faster than your more abstracted code. Timings on the page. Also, -fasm seems to only be a benefit on the Mac, as you've pointed out previously. Maybe you could check the times on the Mac too? -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe