Am Montag 15 Februar 2010 17:37:55 schrieb Simon Marlow: > On 14/02/2010 17:58, Don Stewart wrote: > > igloo: > >> Hi all, > >> > >> We are planning to remove the -fvia-c way of compiling code > >> (unregisterised compilers will continue to compile via C only, but > >> registerised compilers will only use the native code generator). > >> We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in > >> 6.16. > >> > >> Simon Marlow has recently fixed FP performance for modern x86 chips > >> in the native code generator in the HEAD. That was the last reason we > >> know of to prefer via-C to the native code generators. But before we > >> start the removal process, does anyone know of any other problems > >> with the native code generators that need to be fixed first? > > > > Do we have the blessing of the DPH team, wrt. tight, numeric inner > > loops? > > > > As recently as last year -fvia-C -optc-O3 was still useful for some > > microbenchmarks -- what's changed in that time, or is expected to > > change? > > If you have benchmarks that show a significant difference, I'd be > interested to see them.
I have a benchmark (or a couple) from the Beginners mailing list two weeks ago (thread starting in January at http://www.haskell.org/pipermail/beginners/2010-January/003356.html and continued in February at http://www.haskell.org/pipermail/beginners/2010-February/003373.html ff) which show a significant difference. Loop.hs: ======================================== {-# LANGUAGE BangPatterns #-} module Main (main) where main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = (4/eps) !pi14 = pisum mx putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum :: Double -> Double pisum cut = go True 1 0 where go b n s | cut < n = if b then s+1/(2*n) else s-1/(2*n) go True n !s = go False (n+2) (s+recip n) go False n !s = go True (n+2) (s-recip n) ======================================== $ echo '1e-8' | time ./Loop ghc -O2 --make: 4.53s ghc -O2 -fexcess-precision --make: 4.54s ghc -O2 -fvia-C -optc-O3 --make: 7.52s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math --make: 7.53s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math -optc-fno-float-store --make: 3.02s ghc -O2 -fvia-C -optc-O3 -optc-fno-float-store --make: 3.02s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s The loop coded in C and compiled with gcc -O3 [-ffast-math, -fno-float- store, -msse2 make no difference there] also takes 3.02s (gcc-4.3.2), 2.70s with icc -O3 (icc 11.0). It is probably worth pointing out, however, that on Markus Böhm's box running Windows XP, the native code generator produced better code than the via-C route (NCG code was faster there than on my box [openSUSE 11.1], while -O2 -fexcess-precision -fvia-C -optc-O3 on his box was slower than NCG on mine). Similar results for Fusion.hs (uses stream-fusion package) ======================================== module Main (main) where import qualified Data.List.Stream as S main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ S.sum $ S.take n step step :: [Double] step = S.unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== ghc -O2 [-fexcess-precision] --make: 4.22s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s Using lists instead of loops, List.hs ======================================== module Main (main) where import Data.List (unfoldr) main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ sum $ take n step step :: [Double] step = unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== things are much slower, 23.60s vs. 18.15s, but the via-C route is again significantly faster. > > What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend, > so where previously we had to use -fvia-C -fexcess-precision -optc-O3 > etc. to get reasonable floating point performance, now we can use -msse2 > with the native code gen and get about the same results. Can I test whether I get about the same results as with -fvia-C ... for the above? I.e., is it in the HEAD, and would I have to pass -msse2 on the command line or is it implied by -O2 already? > > In the future we have a couple of ways that things could get better: > > 1. The new back-end, which eventually will incorporate more > optimisations at the C-- level, and potentially could produce > good loop code. It will also free up some registers. > > 2. Compiling via LLVM. > > Dropping the C backend will give us more flexibility with calling > conventions, letting us use more of the x86 registers for passing > arguments. We can only make this change by removing -fvia-C, though. > There's low hanging fruit here particularly for the x86 backend, as soon > as we drop -fvia-C. > > There are other reasons to want to get rid of -fvia-C: > > - it doubles the testing surface > > - it's associated with a bucketload of grotesque Perl 4 code and > gcc-specific hacks in the RTS headers. > > Cheers, > Simon _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users