Re: Removing/deprecating -fvia-c
On 14/02/2010 17:58, Don Stewart wrote: igloo: Hi all, We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16. Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first? Do we have the blessing of the DPH team, wrt. tight, numeric inner loops? As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change? If you have benchmarks that show a significant difference, I'd be interested to see them. What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend, so where previously we had to use -fvia-C -fexcess-precision -optc-O3 etc. to get reasonable floating point performance, now we can use -msse2 with the native code gen and get about the same results. In the future we have a couple of ways that things could get better: 1. The new back-end, which eventually will incorporate more optimisations at the C-- level, and potentially could produce good loop code. It will also free up some registers. 2. Compiling via LLVM. Dropping the C backend will give us more flexibility with calling conventions, letting us use more of the x86 registers for passing arguments. We can only make this change by removing -fvia-C, though. There's low hanging fruit here particularly for the x86 backend, as soon as we drop -fvia-C. There are other reasons to want to get rid of -fvia-C: - it doubles the testing surface - it's associated with a bucketload of grotesque Perl 4 code and gcc-specific hacks in the RTS headers. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
-package
I've imported in my module: import Control.Concurrent.STM. My question is: Have I to use -package stm in ghc command line options? Thanks in advance. Luca. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: -package
Hello Luca, Monday, February 15, 2010, 8:01:11 PM, you wrote: I've imported in my module: import Control.Concurrent.STM. My question is: Have I to use -package stm in ghc command line options? either use -package stm or --make. later automatically imports all required packages -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Removing/deprecating -fvia-c
Hi, the things I am interested in are: foldU f init . mapU (\k - array_1 !: (i,k) `combine` array_2 !: (k,j)) $ enumFromToU i j where (!:) = fancy_index_op with both `vector` and `uvector` (then D.V.fold etc). Since ghc 6.12 there has been no significant difference in using either backend. Then again, more time is spent indexing than optimising the tight loop. Viele Gruesse, Christian Hoener zu Siederdissen From: Ian Lynagh ig...@earth.li Hi all, We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16. Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first? Thanks Ian pgpUTSikzBK8H.pgp Description: PGP signature ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Removing/deprecating -fvia-c
marlowsd: Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first? Do we have the blessing of the DPH team, wrt. tight, numeric inner loops? As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change? If you have benchmarks that show a significant difference, I'd be interested to see them. I've attached an example where there's a 40% variation (and it's a floating point benchmark). Roman would be seeing similar examples in the vector code. I'm all in favor of dropping the C backend, but I'm also wary that we don't have benchmarks to know what difference it is making. Here's a simple program testing a tight, floating point loop: import Data.Array.Vector import Data.Complex main = print . sumU $ replicateU (10 :: Int) (1 :+ 1 ::Complex Double) Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box. The -fvia-C -optc-O3 is about 40% faster than -fasm. How does it fair with the new sse patches? I've attached the assembly below for each case.. -- Don Fastest. 2.17s. About 40% faster than -fasm $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.16s user 0.00s system 99% cpu 2.175 total Main_mainzuzdszdwfold_info: leaq32(%r12), %rax movq%r12, %rdx cmpq144(%r13), %rax movq%rax, %r12 ja .L4 cmpq$10, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq1(%r14), %r14 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 movq%rdx, %r12 jmp Main_mainzuzdszdwfold_info .L4: leaq-24(%rbp), %rax movq$32, 184(%r13) movq%rax, %rbp movq%r14, (%rax) movsd %xmm5, 8(%rax) movsd %xmm6, 16(%rax) movl$Main_mainzuzdszdwfold_closure, %ebx jmp *-8(%r13) .L9: movq$ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq$ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq9(%rdx), %r14 jmp *(%rbp) Second, 2.34s $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 2.33s user 0.01s system 99% cpu 2.347 total Main_mainzuzdszdwfold_info: leaq32(%r12), %rax cmpq144(%r13), %rax movq%r12, %rdx movq%rax, %r12 ja .L4 cmpq$1, %r14 je .L9 .L5: movsd .LC0(%rip), %xmm0 leaq1(%r14), %r14 movq%rdx, %r12 addsd %xmm0, %xmm5 addsd %xmm0, %xmm6 jmp Main_mainzuzdszdwfold_info .L4: leaq-24(%rbp), %rax movq$32, 184(%r13) movl$Main_mainzuzdszdwfold_closure, %ebx movsd %xmm5, 8(%rax) movq%rax, %rbp movq%r14, (%rax) movsd %xmm6, 16(%rax) jmp *-8(%r13) .L9: movq$ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax) movsd %xmm5, -16(%rax) movq$ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax) leaq25(%rdx), %rbx movsd %xmm6, 32(%rdx) leaq9(%rdx), %r14 jmp *(%rbp) Native codegen, 3.57s ghc 6.12 -fasm -O2 $ time ./sum-complex 1.0e9 :+ 1.0e9 ./sum-complex 3.57s user 0.01s system 99% cpu 3.574 total Main_mainzuzdszdwfold_info: .Lc1i7: addq $32,%r12 cmpq 144(%r13),%r12 ja .Lc1ia movq %r14,%rax cmpq $1,%rax jne .Lc1id movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12) movsd %xmm5,-16(%r12) movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12) movsd %xmm6,(%r12) leaq -7(%r12),%rbx leaq -23(%r12),%r14 jmp *(%rbp) .Lc1ia: movq $32,184(%r13) movl $Main_mainzuzdszdwfold_closure,%ebx addq $-24,%rbp movq %r14,(%rbp) movsd %xmm5,8(%rbp) movsd %xmm6,16(%rbp) jmp *-8(%r13) .Lc1id: movsd %xmm6,%xmm0 addsd .Ln1if(%rip),%xmm0 movsd %xmm5,%xmm7 addsd .Ln1ig(%rip),%xmm7 leaq 1(%rax),%r14 movsd %xmm7,%xmm5 movsd %xmm0,%xmm6 addq $-32,%r12 jmp Main_mainzuzdszdwfold_info ___
Re: Removing/deprecating -fvia-c
Am Montag 15 Februar 2010 17:37:55 schrieb Simon Marlow: On 14/02/2010 17:58, Don Stewart wrote: igloo: Hi all, We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16. Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first? Do we have the blessing of the DPH team, wrt. tight, numeric inner loops? As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change? If you have benchmarks that show a significant difference, I'd be interested to see them. I have a benchmark (or a couple) from the Beginners mailing list two weeks ago (thread starting in January at http://www.haskell.org/pipermail/beginners/2010-January/003356.html and continued in February at http://www.haskell.org/pipermail/beginners/2010-February/003373.html ff) which show a significant difference. Loop.hs: {-# LANGUAGE BangPatterns #-} module Main (main) where main :: IO () main = do putStrLn EPS: eps - readLn :: IO Double let !mx = (4/eps) !pi14 = pisum mx putStrLn $ PI mit EPS ++(show eps)++ = ++ show(4*pi14) pisum :: Double - Double pisum cut = go True 1 0 where go b n s | cut n = if b then s+1/(2*n) else s-1/(2*n) go True n !s = go False (n+2) (s+recip n) go False n !s = go True (n+2) (s-recip n) $ echo '1e-8' | time ./Loop ghc -O2 --make: 4.53s ghc -O2 -fexcess-precision --make: 4.54s ghc -O2 -fvia-C -optc-O3 --make: 7.52s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math --make: 7.53s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math -optc-fno-float-store --make: 3.02s ghc -O2 -fvia-C -optc-O3 -optc-fno-float-store --make: 3.02s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s The loop coded in C and compiled with gcc -O3 [-ffast-math, -fno-float- store, -msse2 make no difference there] also takes 3.02s (gcc-4.3.2), 2.70s with icc -O3 (icc 11.0). It is probably worth pointing out, however, that on Markus Böhm's box running Windows XP, the native code generator produced better code than the via-C route (NCG code was faster there than on my box [openSUSE 11.1], while -O2 -fexcess-precision -fvia-C -optc-O3 on his box was slower than NCG on mine). Similar results for Fusion.hs (uses stream-fusion package) module Main (main) where import qualified Data.List.Stream as S main :: IO () main = do putStrLn EPS: eps - readLn :: IO Double let !mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ PI mit EPS ++ (show eps) ++ = ++ show (leibniz k) leibniz n = (4 *) $ S.sum $ S.take n step step :: [Double] step = S.unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ghc -O2 [-fexcess-precision] --make: 4.22s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s Using lists instead of loops, List.hs module Main (main) where import Data.List (unfoldr) main :: IO () main = do putStrLn EPS: eps - readLn :: IO Double let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ PI mit EPS ++ (show eps) ++ = ++ show (leibniz k) leibniz n = (4 *) $ sum $ take n step step :: [Double] step = unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) things are much slower, 23.60s vs. 18.15s, but the via-C route is again significantly faster. What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend, so where previously we had to use -fvia-C -fexcess-precision -optc-O3 etc. to get reasonable floating point performance, now we can use -msse2 with the native code gen and get about the same results. Can I test whether I get about the same results as with -fvia-C ... for the above? I.e., is it in the HEAD, and would I have to pass -msse2 on the command line or is it implied by -O2 already? In the future we have a couple of ways that things could get better: 1. The new back-end, which eventually will incorporate more optimisations at the C-- level, and potentially could produce good loop code. It will also free up some registers. 2. Compiling via LLVM. Dropping the C
Re: Removing/deprecating -fvia-c
dons: marlowsd: Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first? Do we have the blessing of the DPH team, wrt. tight, numeric inner loops? As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change? If you have benchmarks that show a significant difference, I'd be interested to see them. I've attached an example where there's a 40% variation (and it's a floating point benchmark). Roman would be seeing similar examples in the vector code. Here's an example that doesn't use floating point: import Data.Array.Vector import Data.Bits main = print . sumU $ zipWith3U (\x y z - x * y * z) (enumFromToU 1 (1 :: Int)) (enumFromToU 2 (10001 :: Int)) (enumFromToU 7 (10008 :: Int)) In core: main_$s$wfold :: Int# - Int# - Int# - Int# - Int# main_$s$wfold = \ (sc_s1l1 :: Int#) (sc1_s1l2 :: Int#) (sc2_s1l3 :: Int#) (sc3_s1l4 :: Int#) - case # sc2_s1l3 1 of _ { False - case # sc1_s1l2 10001 of _ { False - case # sc_s1l1 10008 of _ { False - main_$s$wfold (+# sc_s1l1 1) (+# sc1_s1l2 1) (+# sc2_s1l3 1) (+# sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1)); True - sc3_s1l4 }; True - sc3_s1l4 }; True - sc3_s1l4 } Rather nice! -fvia-C -optc-O3 Main_mainzuzdszdwfold_info: cmpq$1, %rdi jg .L6 cmpq$10001, %rsi jg .L6 cmpq$10008, %r14 jle .L10 .L6: movq%r8, %rbx movq(%rbp), %rax jmp *%rax .L10: movq%rsi, %r10 leaq1(%rsi), %rsi imulq %rdi, %r10 leaq1(%rdi), %rdi imulq %r14, %r10 leaq1(%r14), %r14 leaq(%r10,%r8), %r8 jmp Main_mainzuzdszdwfold_info Which looks ok. $ time ./zipwith3 3541230156834269568 ./zipwith3 0.33s user 0.00s system 99% cpu 0.337 total And -fasm we get very different code, and a bit of a slowdown: Main_mainzuzdszdwfold_info: .Lc1mo: cmpq $1,%rdi jg .Lc1mq cmpq $10001,%rsi jg .Lc1ms cmpq $10008,%r14 jg .Lc1mv movq %rsi,%rax imulq %r14,%rax movq %rdi,%rcx imulq %rax,%rcx movq %r8,%rax addq %rcx,%rax leaq 1(%rdi),%rcx leaq 1(%rsi),%rdx incq %r14 movq %rdx,%rsi movq %rcx,%rdi movq %rax,%r8 jmp Main_mainzuzdszdwfold_info .Lc1mq: movq %r8,%rbx jmp *(%rbp) .Lc1ms: movq %r8,%rbx jmp *(%rbp) .Lc1mv: movq %r8,%rbx jmp *(%rbp) Slower: $ time ./zipwith3 3541230156834269568 ./zipwith3 0.38s user 0.00s system 98% cpu 0.384 total Now maybe we need to wait on the new backend optimizations to get there? -- Don ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Quasi quoting
marlowsd: On 03/02/2010 15:39, Simon Peyton-Jones wrote: | Or we could switch to different quotation brackets altogether for | quasiquotation, the obvious possibility being|...blah...|, and | pads|...blah...|. That would not be hard, and would only affect the | handful of current quasiquote users. But it'd remove | and | as a | valid operators, at least for quasiquote customers. I don't know how bad | that would be. | | Good brackets are scarce. I'd prefer to stick with one of the many | fine variations on [|...|] currently being discussed. I agree with this. My gut feel is to stick with [| ..|] and variants, and live with the fact that TH and QQ aren't using them in quite the same way. Why not provide some nice Unicode version too? ⟦ .. ⟧ ⟪ .. ⟫ ⦃ .. ⦄ etc. OH! That looks very nice. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users