Re: Removing/deprecating -fvia-c

2010-02-15 Thread Simon Marlow

On 14/02/2010 17:58, Don Stewart wrote:

igloo:


Hi all,

We are planning to remove the -fvia-c way of compiling code
(unregisterised compilers will continue to compile via C only, but
registerised compilers will only use the native code generator).
We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in
6.16.

Simon Marlow has recently fixed FP performance for modern x86 chips in
the native code generator in the HEAD. That was the last reason we know
of to prefer via-C to the native code generators. But before we start
the removal process, does anyone know of any other problems with the
native code generators that need to be fixed first?



Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?

As recently as last year -fvia-C -optc-O3 was still useful for some
microbenchmarks -- what's changed in that time, or is expected to change?


If you have benchmarks that show a significant difference, I'd be 
interested to see them.


What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend, 
so where previously we had to use -fvia-C -fexcess-precision -optc-O3 
etc. to get reasonable floating point performance, now we can use -msse2 
with the native code gen and get about the same results.


In the future we have a couple of ways that things could get better:

 1. The new back-end, which eventually will incorporate more
optimisations at the C-- level, and potentially could produce
good loop code.  It will also free up some registers.

 2. Compiling via LLVM.

Dropping the C backend will give us more flexibility with calling 
conventions, letting us use more of the x86 registers for passing 
arguments.  We can only make this change by removing -fvia-C, though. 
There's low hanging fruit here particularly for the x86 backend, as soon 
as we drop -fvia-C.


There are other reasons to want to get rid of -fvia-C:

 - it doubles the testing surface

 - it's associated with a bucketload of grotesque Perl 4 code and
   gcc-specific hacks in the RTS headers.

Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


-package

2010-02-15 Thread Luca Ciciriello
I've imported in my module: import Control.Concurrent.STM.

My question is: Have I to use -package stm in ghc command line options?

Thanks in advance.

Luca.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: -package

2010-02-15 Thread Bulat Ziganshin
Hello Luca,

Monday, February 15, 2010, 8:01:11 PM, you wrote:

 I've imported in my module: import Control.Concurrent.STM.
 My question is: Have I to use -package stm in ghc command line options?

either use -package stm or --make. later automatically imports all
required packages

-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Removing/deprecating -fvia-c

2010-02-15 Thread Christian Höner zu Siederdissen
Hi,

the things I am interested in are:

foldU f init .
mapU (\k - array_1 !: (i,k) `combine` array_2 !: (k,j)) $
enumFromToU i j

where (!:) = fancy_index_op

with both `vector` and `uvector` (then D.V.fold etc).

Since ghc 6.12 there has been no significant difference in using either
backend. Then again, more time is spent indexing than optimising the
tight loop.

Viele Gruesse,
Christian Hoener zu Siederdissen

From: Ian Lynagh ig...@earth.li
 
 
 Hi all,
 
 We are planning to remove the -fvia-c way of compiling code
 (unregisterised compilers will continue to compile via C only, but
 registerised compilers will only use the native code generator).
 We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in
 6.16.
 
 Simon Marlow has recently fixed FP performance for modern x86 chips in
 the native code generator in the HEAD. That was the last reason we know
 of to prefer via-C to the native code generators. But before we start
 the removal process, does anyone know of any other problems with the
 native code generators that need to be fixed first?
 
 
 Thanks
 Ian
 
 


pgpUTSikzBK8H.pgp
Description: PGP signature
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Removing/deprecating -fvia-c

2010-02-15 Thread Don Stewart
marlowsd:

 Simon Marlow has recently fixed FP performance for modern x86 chips in
 the native code generator in the HEAD. That was the last reason we know
 of to prefer via-C to the native code generators. But before we start
 the removal process, does anyone know of any other problems with the
 native code generators that need to be fixed first?


 Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?

 As recently as last year -fvia-C -optc-O3 was still useful for some
 microbenchmarks -- what's changed in that time, or is expected to change?

 If you have benchmarks that show a significant difference, I'd be  
 interested to see them.

I've attached an example where there's a 40% variation (and it's a
floating point benchmark). Roman would be seeing similar examples in the
vector code.

I'm all in favor of dropping the C backend, but I'm also wary that we
don't have benchmarks to know what difference it is making.

Here's a simple program testing a tight, floating point loop:

import Data.Array.Vector
import Data.Complex

main = print . sumU $ replicateU (10 :: Int) (1 :+ 1 ::Complex 
Double)

Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.

The -fvia-C -optc-O3 is about 40% faster than -fasm.
How does it fair with the new sse patches?

I've attached the assembly below for each case..

-- Don




Fastest. 2.17s. About 40% faster than -fasm

$ time ./sum-complex 
1.0e9 :+ 1.0e9
./sum-complex  2.16s user 0.00s system 99% cpu 2.175 total

Main_mainzuzdszdwfold_info:
leaq32(%r12), %rax
movq%r12, %rdx
cmpq144(%r13), %rax
movq%rax, %r12
ja  .L4
cmpq$10, %r14
je  .L9
.L5:
movsd   .LC0(%rip), %xmm0
leaq1(%r14), %r14
addsd   %xmm0, %xmm5
addsd   %xmm0, %xmm6
movq%rdx, %r12
jmp Main_mainzuzdszdwfold_info

.L4:
leaq-24(%rbp), %rax
movq$32, 184(%r13)
movq%rax, %rbp
movq%r14, (%rax)
movsd   %xmm5, 8(%rax)
movsd   %xmm6, 16(%rax)
movl$Main_mainzuzdszdwfold_closure, %ebx
jmp *-8(%r13)
.L9:
movq$ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
movsd   %xmm5, -16(%rax)
movq$ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
leaq25(%rdx), %rbx
movsd   %xmm6, 32(%rdx)
leaq9(%rdx), %r14
jmp *(%rbp)



Second, 2.34s

$ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3
$ time ./sum-complex
1.0e9 :+ 1.0e9
./sum-complex  2.33s user 0.01s system 99% cpu 2.347 total

Main_mainzuzdszdwfold_info:
leaq32(%r12), %rax
cmpq144(%r13), %rax
movq%r12, %rdx
movq%rax, %r12
ja  .L4
cmpq$1, %r14
je  .L9
.L5:
movsd   .LC0(%rip), %xmm0
leaq1(%r14), %r14
movq%rdx, %r12
addsd   %xmm0, %xmm5
addsd   %xmm0, %xmm6
jmp Main_mainzuzdszdwfold_info

.L4:
leaq-24(%rbp), %rax
movq$32, 184(%r13)
movl$Main_mainzuzdszdwfold_closure, %ebx
movsd   %xmm5, 8(%rax)
movq%rax, %rbp
movq%r14, (%rax)
movsd   %xmm6, 16(%rax)
jmp *-8(%r13)

.L9:
movq$ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
movsd   %xmm5, -16(%rax)
movq$ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
leaq25(%rdx), %rbx
movsd   %xmm6, 32(%rdx)
leaq9(%rdx), %r14
jmp *(%rbp)



Native codegen, 3.57s

 ghc 6.12 -fasm -O2
 $ time ./sum-complex
 1.0e9 :+ 1.0e9
 ./sum-complex  3.57s user 0.01s system 99% cpu 3.574 total


Main_mainzuzdszdwfold_info:
.Lc1i7:
addq $32,%r12
cmpq 144(%r13),%r12
ja .Lc1ia
movq %r14,%rax
cmpq $1,%rax
jne .Lc1id
movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
movsd %xmm5,-16(%r12)
movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
movsd %xmm6,(%r12)
leaq -7(%r12),%rbx
leaq -23(%r12),%r14
jmp *(%rbp)
.Lc1ia:
movq $32,184(%r13)
movl $Main_mainzuzdszdwfold_closure,%ebx
addq $-24,%rbp
movq %r14,(%rbp)
movsd %xmm5,8(%rbp)
movsd %xmm6,16(%rbp)
jmp *-8(%r13)
.Lc1id:
movsd %xmm6,%xmm0
addsd .Ln1if(%rip),%xmm0
movsd %xmm5,%xmm7
addsd .Ln1ig(%rip),%xmm7
leaq 1(%rax),%r14
movsd %xmm7,%xmm5
movsd %xmm0,%xmm6
addq $-32,%r12
jmp Main_mainzuzdszdwfold_info


___

Re: Removing/deprecating -fvia-c

2010-02-15 Thread Daniel Fischer
Am Montag 15 Februar 2010 17:37:55 schrieb Simon Marlow:
 On 14/02/2010 17:58, Don Stewart wrote:
  igloo:
  Hi all,
 
  We are planning to remove the -fvia-c way of compiling code
  (unregisterised compilers will continue to compile via C only, but
  registerised compilers will only use the native code generator).
  We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in
  6.16.
 
  Simon Marlow has recently fixed FP performance for modern x86 chips
  in the native code generator in the HEAD. That was the last reason we
  know of to prefer via-C to the native code generators. But before we
  start the removal process, does anyone know of any other problems
  with the native code generators that need to be fixed first?
 
  Do we have the blessing of the DPH team, wrt. tight, numeric inner
  loops?
 
  As recently as last year -fvia-C -optc-O3 was still useful for some
  microbenchmarks -- what's changed in that time, or is expected to
  change?

 If you have benchmarks that show a significant difference, I'd be
 interested to see them.

I have a benchmark (or a couple) from the Beginners mailing list two weeks 
ago (thread starting in January at 
http://www.haskell.org/pipermail/beginners/2010-January/003356.html and 
continued in February at 
http://www.haskell.org/pipermail/beginners/2010-February/003373.html ff) 
which show a significant difference.

Loop.hs:

{-# LANGUAGE BangPatterns #-}
module Main (main) where

main :: IO ()
main = do
putStrLn EPS: 
eps - readLn :: IO Double
let !mx = (4/eps)
!pi14 = pisum mx
putStrLn $ PI mit EPS ++(show eps)++ = ++ show(4*pi14)

pisum :: Double - Double
pisum cut = go True 1 0
  where
go b n s | cut  n = if b then s+1/(2*n) else s-1/(2*n)
go True n !s = go False (n+2) (s+recip n)
go False n !s = go True (n+2) (s-recip n)


$ echo '1e-8' | time ./Loop

ghc -O2 --make:
4.53s
ghc -O2 -fexcess-precision --make:
4.54s
ghc -O2 -fvia-C -optc-O3 --make:
7.52s
ghc -O2 -fvia-C -optc-O3 -optc-ffast-math --make:
7.53s
ghc -O2 -fvia-C -optc-O3 -optc-ffast-math -optc-fno-float-store --make:
3.02s
ghc -O2 -fvia-C -optc-O3 -optc-fno-float-store --make:
3.02s
ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make:
3.02s

The loop coded in C and compiled with gcc -O3 [-ffast-math, -fno-float-
store, -msse2 make no difference there] also takes 3.02s (gcc-4.3.2), 2.70s 
with icc -O3 (icc 11.0).

It is probably worth pointing out, however, that on Markus Böhm's box 
running Windows XP, the native code generator produced better code than the 
via-C route (NCG code was faster there than on my box [openSUSE 11.1], 
while -O2 -fexcess-precision -fvia-C -optc-O3 on his box was slower than 
NCG on mine).

Similar results for

Fusion.hs (uses stream-fusion package)

module Main (main) where

import qualified Data.List.Stream as S

main :: IO ()
main = do
putStrLn EPS: 
eps - readLn :: IO Double
let !mx = floor (4/eps)
!k = (mx+1) `quot` 2
putStrLn $ PI mit EPS  ++ (show eps) ++  =  ++ show (leibniz k)

leibniz n = (4 *) $ S.sum $ S.take n step

step :: [Double]
step = S.unfoldr phi (True,1) where
   phi (sig,d) | sig = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))


ghc -O2 [-fexcess-precision] --make:
4.22s
ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make:
3.02s

Using lists instead of loops,

List.hs

module Main (main) where

import Data.List (unfoldr)

main :: IO ()
main = do
    putStrLn EPS: 
    eps - readLn :: IO Double
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ PI mit EPS  ++ (show eps) ++  =  ++ show (leibniz k)

leibniz n = (4 *) $ sum $ take n step

step :: [Double]
step = unfoldr phi (True,1) where
   phi (sig,d) | sig         = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))


things are much slower, 23.60s vs. 18.15s, but the via-C route is again 
significantly faster.


 What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend,
 so where previously we had to use -fvia-C -fexcess-precision -optc-O3
 etc. to get reasonable floating point performance, now we can use -msse2
 with the native code gen and get about the same results.

Can I test whether I get about the same results as with -fvia-C ... for the 
above?
I.e., is it in the HEAD, and would I have to pass -msse2 on the command 
line or is it implied by -O2 already?


 In the future we have a couple of ways that things could get better:

   1. The new back-end, which eventually will incorporate more
  optimisations at the C-- level, and potentially could produce
  good loop code.  It will also free up some registers.

   2. Compiling via LLVM.

 Dropping the C 

Re: Removing/deprecating -fvia-c

2010-02-15 Thread Don Stewart
dons:
 marlowsd:
 
  Simon Marlow has recently fixed FP performance for modern x86 chips in
  the native code generator in the HEAD. That was the last reason we know
  of to prefer via-C to the native code generators. But before we start
  the removal process, does anyone know of any other problems with the
  native code generators that need to be fixed first?
 
 
  Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
 
  As recently as last year -fvia-C -optc-O3 was still useful for some
  microbenchmarks -- what's changed in that time, or is expected to change?
 
  If you have benchmarks that show a significant difference, I'd be  
  interested to see them.
 
 I've attached an example where there's a 40% variation (and it's a
 floating point benchmark). Roman would be seeing similar examples in the
 vector code.

Here's an example that doesn't use floating point:

import Data.Array.Vector
import Data.Bits

main = print . sumU $ zipWith3U (\x y z - x * y * z)
(enumFromToU 1 (1 :: Int))
(enumFromToU 2 (10001 :: Int))
(enumFromToU 7 (10008 :: Int))

In core:

main_$s$wfold :: Int# - Int# - Int# - Int# - Int#
main_$s$wfold =
  \ (sc_s1l1 :: Int#)
(sc1_s1l2 :: Int#)
(sc2_s1l3 :: Int#)
(sc3_s1l4 :: Int#) -
case # sc2_s1l3 1 of _ {
  False -
case # sc1_s1l2 10001 of _ {
  False -
case # sc_s1l1 10008 of _ {
  False -
main_$s$wfold
  (+# sc_s1l1 1)
  (+# sc1_s1l2 1)
  (+# sc2_s1l3 1)
  (+#
 sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1));
  True - sc3_s1l4
};
  True - sc3_s1l4
};
  True - sc3_s1l4
}

Rather nice!

-fvia-C -optc-O3

Main_mainzuzdszdwfold_info:
cmpq$1, %rdi
jg  .L6
cmpq$10001, %rsi
jg  .L6
cmpq$10008, %r14
jle .L10
.L6:
movq%r8, %rbx
movq(%rbp), %rax
jmp *%rax
.L10:
movq%rsi, %r10
leaq1(%rsi), %rsi
imulq   %rdi, %r10
leaq1(%rdi), %rdi
imulq   %r14, %r10
leaq1(%r14), %r14
leaq(%r10,%r8), %r8
jmp Main_mainzuzdszdwfold_info

Which looks ok.

$ time ./zipwith3  
3541230156834269568
./zipwith3  0.33s user 0.00s system 99% cpu 0.337 total

And -fasm we get very different code, and a bit of a slowdown:

Main_mainzuzdszdwfold_info:
.Lc1mo:
cmpq $1,%rdi
jg .Lc1mq
cmpq $10001,%rsi
jg .Lc1ms
cmpq $10008,%r14
jg .Lc1mv

movq %rsi,%rax
imulq %r14,%rax
movq %rdi,%rcx
imulq %rax,%rcx
movq %r8,%rax
addq %rcx,%rax
leaq 1(%rdi),%rcx
leaq 1(%rsi),%rdx
incq %r14
movq %rdx,%rsi
movq %rcx,%rdi
movq %rax,%r8
jmp Main_mainzuzdszdwfold_info

.Lc1mq:
movq %r8,%rbx
jmp *(%rbp)
.Lc1ms:
movq %r8,%rbx
jmp *(%rbp)
.Lc1mv:
movq %r8,%rbx
jmp *(%rbp)

Slower:

$ time ./zipwith3
3541230156834269568
./zipwith3  0.38s user 0.00s system 98% cpu 0.384 total

Now maybe we need to wait on the new backend optimizations to get there?

-- Don
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Quasi quoting

2010-02-15 Thread Don Stewart
marlowsd:
 On 03/02/2010 15:39, Simon Peyton-Jones wrote:
 |  Or we could switch to different quotation brackets altogether for
 |  quasiquotation, the obvious possibility being|...blah...|, and
 |  pads|...blah...|.  That would not be hard, and would only affect the
 |  handful of current quasiquote users.  But it'd remove | and | as a
 |  valid operators, at least for quasiquote customers.  I don't know how bad
 |  that would be.
 |
 | Good brackets are scarce.  I'd prefer to stick with one of the many
 | fine variations on [|...|] currently being discussed.

 I agree with this.  My gut feel is to stick with [| ..|] and variants, and 
 live with the fact that TH and QQ aren't using them in quite the same way.

 Why not provide some nice Unicode version too? ⟦ .. ⟧  ⟪ .. ⟫  ⦃ .. ⦄ etc.


OH! That looks very nice.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users