On 15/02/2010 18:29, Don Stewart wrote:
marlowsd:
Simon Marlow has recently fixed FP performance for modern x86 chips in
the native code generator in the HEAD. That was the last reason we know
of to prefer via-C to the native code generators. But before we start
the removal process, does anyone know of any other problems with the
native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some
microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be
interested to see them.
I've attached an example where there's a 40% variation (and it's a
floating point benchmark). Roman would be seeing similar examples in the
vector code.
I'm all in favor of dropping the C backend, but I'm also wary that we
don't have benchmarks to know what difference it is making.
Here's a simple program testing a tight, floating point loop:
import Data.Array.Vector
import Data.Complex
main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex
Double)
Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.
The -fvia-C -optc-O3 is about 40% faster than -fasm.
How does it fair with the new sse patches?
I've attached the assembly below for each case..
-- Don
------------------------------------------------------------------------
Fastest. 2.17s. About 40% faster than -fasm
$ time ./sum-complex
1.0e9 :+ 1.0e9
./sum-complex 2.16s user 0.00s system 99% cpu 2.175 total
Main_mainzuzdszdwfold_info:
leaq 32(%r12), %rax
movq %r12, %rdx
cmpq 144(%r13), %rax
movq %rax, %r12
ja .L4
cmpq $1000000000, %r14
je .L9
.L5:
movsd .LC0(%rip), %xmm0
leaq 1(%r14), %r14
addsd %xmm0, %xmm5
addsd %xmm0, %xmm6
movq %rdx, %r12
jmp Main_mainzuzdszdwfold_info
.L4:
leaq -24(%rbp), %rax
movq $32, 184(%r13)
movq %rax, %rbp
movq %r14, (%rax)
movsd %xmm5, 8(%rax)
movsd %xmm6, 16(%rax)
movl $Main_mainzuzdszdwfold_closure, %ebx
jmp *-8(%r13)
.L9:
movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
movsd %xmm5, -16(%rax)
movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
leaq 25(%rdx), %rbx
movsd %xmm6, 32(%rdx)
leaq 9(%rdx), %r14
jmp *(%rbp)
------------------------------------------------------------------------
Second, 2.34s
$ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3
$ time ./sum-complex
1.0e9 :+ 1.0e9
./sum-complex 2.33s user 0.01s system 99% cpu 2.347 total
Main_mainzuzdszdwfold_info:
leaq 32(%r12), %rax
cmpq 144(%r13), %rax
movq %r12, %rdx
movq %rax, %r12
ja .L4
cmpq $100000000, %r14
je .L9
.L5:
movsd .LC0(%rip), %xmm0
leaq 1(%r14), %r14
movq %rdx, %r12
addsd %xmm0, %xmm5
addsd %xmm0, %xmm6
jmp Main_mainzuzdszdwfold_info
.L4:
leaq -24(%rbp), %rax
movq $32, 184(%r13)
movl $Main_mainzuzdszdwfold_closure, %ebx
movsd %xmm5, 8(%rax)
movq %rax, %rbp
movq %r14, (%rax)
movsd %xmm6, 16(%rax)
jmp *-8(%r13)
.L9:
movq $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
movsd %xmm5, -16(%rax)
movq $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
leaq 25(%rdx), %rbx
movsd %xmm6, 32(%rdx)
leaq 9(%rdx), %r14
jmp *(%rbp)
------------------------------------------------------------------------
Native codegen, 3.57s
ghc 6.12 -fasm -O2
$ time ./sum-complex
1.0e9 :+ 1.0e9
./sum-complex 3.57s user 0.01s system 99% cpu 3.574 total
Main_mainzuzdszdwfold_info:
.Lc1i7:
addq $32,%r12
cmpq 144(%r13),%r12
ja .Lc1ia
movq %r14,%rax
cmpq $100000000,%rax
jne .Lc1id
movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
movsd %xmm5,-16(%r12)
movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
movsd %xmm6,(%r12)
leaq -7(%r12),%rbx
leaq -23(%r12),%r14
jmp *(%rbp)
.Lc1ia:
movq $32,184(%r13)
movl $Main_mainzuzdszdwfold_closure,%ebx
addq $-24,%rbp
movq %r14,(%rbp)
movsd %xmm5,8(%rbp)
movsd %xmm6,16(%rbp)
jmp *-8(%r13)
.Lc1id:
movsd %xmm6,%xmm0
addsd .Ln1if(%rip),%xmm0
movsd %xmm5,%xmm7
addsd .Ln1ig(%rip),%xmm7
leaq 1(%rax),%r14
movsd %xmm7,%xmm5
movsd %xmm0,%xmm6
addq $-32,%r12
jmp Main_mainzuzdszdwfold_info
I manged to improve this:
Main_mainzuzdszdwfold_info:
.Lc1lP:
addq $32,%r12
cmpq 144(%r13),%r12
ja .Lc1lS
movq %r14,%rax
cmpq $1000000000,%rax
jne .Lc1lV
movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
movsd %xmm6,-16(%r12)
movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
movsd %xmm5,(%r12)
leaq -7(%r12),%rbx
leaq -23(%r12),%r14
jmp *(%rbp)
.Lc1lS:
movq $32,184(%r13)
movl $Main_mainzuzdszdwfold_closure,%ebx
addq $-24,%rbp
movsd %xmm5,(%rbp)
movsd %xmm6,8(%rbp)
movq %r14,16(%rbp)
jmp *-8(%r13)
.Lc1lV:
addsd .Ln1m2(%rip),%xmm5
addsd .Ln1m3(%rip),%xmm6
leaq 1(%rax),%r14
addq $-32,%r12
jmp Main_mainzuzdszdwfold_info
from 9 instructions in the last block down to 5 (one instruction fewer
than gcc). I haven't commoned up the two constant 1's though, that'd
mean doing some CSE.
On my machine with GHC HEAD and gcc 4.3.0, the gcc version runs in 2.0s,
with the NCG at 2.3s. I put the difference down to a bit of instruction
scheduling done by gcc, and that extra constant load.
But let's face it, all of this code is crappy. It should be a tiny
little loop rather than a tail-call with argument passing, and that's
what we'll get with the new backend (eventually). LLVM probably won't
turn it into a loop on its own, that needs to be done before the code
gets passed to LLVM.
Have you looked at this example on x86? It's *far* worse and runs about
5 times slower.
Cheers,
Simon
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users