joelkoerwer: > > Thanks Chris. I was actually asking about analyzing Core > output in general. I'm well aware of the problems we're > having with the nbody entry. > I'm convinced my list based version can go faster than it is > now. That's why I was asking if Don could put together a few > notes on how to optimize inner loops using -ddump-simpl and > the resulting Core code.
Here's a brief introduction. I intend to write up (on the performance page on the wiki) a list of things we've done to improve the shootout entries. N.B we're now the 3rd *fastest* language, behind C and only a little behind D (a C varient) !! Consider the partial sums problem: wiki: http://www.haskell.org/hawiki/PartialSumsEntry site: http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=ghc&id=2 What follows is a discussion of the steps I took to improve the performance of this code. Here's the naive translation of the Clean entry (which was fairly quick): Lots of math in a tight loop. > import System; import Numeric > > main = do n <- getArgs >>= readIO . head > let sums = loop 1 n 1 0 0 0 0 0 0 0 0 0 > fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ "\t" ++ t > mapM_ (fn :: (Double, String) -> IO ()) (zip sums names) > > names = ["(2/3)^k", "k^-0.5", "1/k(k+1)", "Flint Hills", "Cookson Hills" > , "Harmonic", "Riemann Zeta", "Alternating Harmonic", "Gregory"] > > loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 > | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] > | otherwise = loop (k+1) n (-alt) > (a1 + (2/3) ** (k-1)) > (a2 + k ** (-0.5)) > (a3 + 1 / (k * (k + 1))) > (a4 + 1 / (k*k*k * sin k * sin k)) > (a5 + 1 / (k*k*k * cos k * cos k)) > (a6 + 1 / k) > (a7 + 1 / (k*k)) > (a8 + alt / k) > (a9 + alt / (2 * k - 1)) Compiled with "-O2". However, the performance is _really_ bad :/ Somewhere greater than 128M heap, in fact eventually running out of memory on my laptop. A classic space leak. (2) So look at the generated core. "ghc -o naive Naive.hs -O2 -ddump-simpl | less" And we find that our loop has the following type: > $sloop_r2U6 :: GHC.Prim.Double# > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Prim.Double# > -> [GHC.Float.Double] Hmm. Ok, I certainly don't want boxed doubles in such a tight loop. (3) My next step is to encourage GHC to unbox this loop, by providing some strictness annotations. Now the loop looks like this: > loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 > | () !k !n !False = undefined > | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] > | otherwise = loop (k+1) n (-alt) > (a1 + (2/3) ** (k-1)) > (a2 + k ** (-0.5)) > (a3 + 1 / (k * (k + 1))) > (a4 + 1 / (k*k*k * sin k * sin k)) > (a5 + 1 / (k*k*k * cos k * cos k)) > (a6 + 1 / k) > (a7 + 1 / (k*k)) > (a8 + alt / k) > (a9 + alt / (2 * k - 1)) where x ! y = x `seq` y I've played a little game here, using ! for `seq`, reminiscent of the new !-pattern proposal for strictness. Let's see how this compiles. Here's the Core: > $sloop_r2Vh :: GHC.Prim.Double# > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Float.Double > -> GHC.Prim.Double# > -> [GHC.Float.Double] Ok, so it unboxed one extra argument. Let's see if we can get them all unboxed. Strictify all args, and GHC produces an inner loop of: > $sloop_r2WS :: GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> GHC.Prim.Double# > -> [GHC.Float.Double] Ah! perfect. Let's see how that runs: $ ghc Naive.hs -O2 -no-recomp $ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 4.45s user 0.02s system 99% cpu 4.482 total (4) Not too bad. No space leak and quite zippy. But let's see what more can be done. First, double arithmetic usually benefits from -fexcess-precision, and cranking up the flags to gcc: paprika$ ghc Naive.hs -O2 -fexcess-precision -optc-O3 -optc-ffast-math -no-recomp paprika$ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.71s user 0.01s system 99% cpu 3.726 total Even better. Now, let's dive into the Core to see if there are any optimisation opportunites that GHC missed. So add -ddump-simpl and peruse the output. (5) Looking at the Core, I see firstly that some of the common subexpressions haven't been factored out: case [GHC.Float.Double] GHC.Prim./## 1.0 (GHC.Prim.*## (GHC.Prim.*## (GHC.Prim.*## (GHC.Prim.*## sc10_s2VS sc10_s2VS) sc10_s2VS) (GHC.Prim.sinDouble# sc10_s2VS)) (GHC.Prim.sinDouble# sc10_s2VS)) Multiple calls to sin. Hmm :/ And similar for cos, as well as k*k. Not sure why GHC isn't removing these (SimonM?), so let's do that by hand, and the inner loop looks like: > loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 > | () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined > | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] > | otherwise = loop (k+1) n (-alt) > (a1 + (2/3) ** (k-1)) > (a2 + k ** (-0.5)) > (a3 + 1 / (k * (k + 1))) > (a4 + 1 / (k3 * sk * sk)) > (a5 + 1 / (k3 * ck * ck)) > (a6 + 1 / k) > (a7 + 1 / k2) > (a8 + alt / k) > (a9 + alt / (2 * k - 1)) > where sk = sin k ; ck = cos k; k2 = k * k; k3 = k2 * k; x ! y = x `seq` y looking at the Core shows the sins are now allocated and shared: > let a9_s2MI :: GHC.Prim.Double# > a9_s2MI = GHC.Prim.sinDouble# sc10_s2Xa So the common expressions are floated out, and it now runs: paprika$ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.29s user 0.00s system 99% cpu 3.290 total Faster. So we gained 12% by floating out those common expressions. (6) Finally, another trick. When I checked the C entry, it used an integer for the k parameter to the loop, and cast it to a double for the math each time around, so perhaps we can make it an Int parameter. And secondly, the alt parameter only has it's sign flipped each time, so perhaps we can factor out the alt / k arg (it's either 1 / k or -1 on k), saving a division. So the final loop looks like: > loop i n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 > | () !i !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined > | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] > | otherwise = loop (i+1) n (-alt) > (a1 + (2/3) ** (k-1)) > (a2 + k ** (-0.5)) > (a3 + 1 / (k * (k + 1))) > (a4 + 1 / (k3 * sk * sk)) > (a5 + 1 / (k3 * ck * ck)) > (a6 + dk) > (a7 + 1 / k2) > (a8 + alt * dk) > (a9 + alt / (2 * k - 1)) > where k = fromIntegral i > dk = 1/k > sk = sin k > ck = cos k > k2 = k * k > k3 = k2 * k > x ! y = x `seq` y Checking the generated C code shows that the same operations are generated as the C entry. And it runs: $ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.17s user 0.01s system 99% cpu 3.206 total This entry in fact runs faster than the original (though not the new vectorised entry) optimised C entry (and faster than all other languages): http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=all So, by carefully tweaking things, we first squished a space leak, and then gained another 30%. In summary: * Check the Core that is generated * Watch out for optimisations that are missed * Read the generated C for the tight loops. * Make sure tight loops are unboxed * Use -fexcess-precision and -optc-ffast-math for doubles This is roughly the process I used for the other shootout entries. Cheers, Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe