On 15/02/2010 18:29, Don Stewart wrote:
marlowsd:

Simon Marlow has recently fixed FP performance for modern x86 chips in
the native code generator in the HEAD. That was the last reason we know
of to prefer via-C to the native code generators. But before we start
the removal process, does anyone know of any other problems with the
native code generators that need to be fixed first?


Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?

As recently as last year -fvia-C -optc-O3 was still useful for some
microbenchmarks -- what's changed in that time, or is expected to change?

If you have benchmarks that show a significant difference, I'd be
interested to see them.

I've attached an example where there's a 40% variation (and it's a
floating point benchmark). Roman would be seeing similar examples in the
vector code.

I'm all in favor of dropping the C backend, but I'm also wary that we
don't have benchmarks to know what difference it is making.

Here's a simple program testing a tight, floating point loop:

     import Data.Array.Vector
     import Data.Complex

     main = print . sumU $ replicateU (1000000000 :: Int) (1 :+ 1 ::Complex 
Double)

Compiled with ghc 6.12, uvector-0.1.1.0 on a 64 bit linux box.

The -fvia-C -optc-O3 is about 40% faster than -fasm.
How does it fair with the new sse patches?

I've attached the assembly below for each case..

-- Don


------------------------------------------------------------------------

Fastest. 2.17s. About 40% faster than -fasm

     $ time ./sum-complex
     1.0e9 :+ 1.0e9
     ./sum-complex  2.16s user 0.00s system 99% cpu 2.175 total

Main_mainzuzdszdwfold_info:
         leaq    32(%r12), %rax
         movq    %r12, %rdx
         cmpq    144(%r13), %rax
         movq    %rax, %r12
         ja      .L4
         cmpq    $1000000000, %r14
         je      .L9
.L5:
         movsd   .LC0(%rip), %xmm0
         leaq    1(%r14), %r14
         addsd   %xmm0, %xmm5
         addsd   %xmm0, %xmm6
         movq    %rdx, %r12
         jmp     Main_mainzuzdszdwfold_info

.L4:
         leaq    -24(%rbp), %rax
         movq    $32, 184(%r13)
         movq    %rax, %rbp
         movq    %r14, (%rax)
         movsd   %xmm5, 8(%rax)
         movsd   %xmm6, 16(%rax)
         movl    $Main_mainzuzdszdwfold_closure, %ebx
         jmp     *-8(%r13)
.L9:
         movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
         movsd   %xmm5, -16(%rax)
         movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
         leaq    25(%rdx), %rbx
         movsd   %xmm6, 32(%rdx)
         leaq    9(%rdx), %r14
         jmp     *(%rbp)

------------------------------------------------------------------------

Second, 2.34s

     $ ghc-core sum-complex.hs -O2 -fvia-C -optc-O3
     $ time ./sum-complex
     1.0e9 :+ 1.0e9
     ./sum-complex  2.33s user 0.01s system 99% cpu 2.347 total

Main_mainzuzdszdwfold_info:
         leaq    32(%r12), %rax
         cmpq    144(%r13), %rax
         movq    %r12, %rdx
         movq    %rax, %r12
         ja      .L4
         cmpq    $100000000, %r14
         je      .L9
.L5:
         movsd   .LC0(%rip), %xmm0
         leaq    1(%r14), %r14
         movq    %rdx, %r12
         addsd   %xmm0, %xmm5
         addsd   %xmm0, %xmm6
         jmp     Main_mainzuzdszdwfold_info

.L4:
         leaq    -24(%rbp), %rax
         movq    $32, 184(%r13)
         movl    $Main_mainzuzdszdwfold_closure, %ebx
         movsd   %xmm5, 8(%rax)
         movq    %rax, %rbp
         movq    %r14, (%rax)
         movsd   %xmm6, 16(%rax)
         jmp     *-8(%r13)

.L9:
         movq    $ghczmprim_GHCziTypes_Dzh_con_info, -24(%rax)
         movsd   %xmm5, -16(%rax)
         movq    $ghczmprim_GHCziTypes_Dzh_con_info, -8(%rax)
         leaq    25(%rdx), %rbx
         movsd   %xmm6, 32(%rdx)
         leaq    9(%rdx), %r14
         jmp     *(%rbp)

------------------------------------------------------------------------

Native codegen, 3.57s

  ghc 6.12 -fasm -O2
  $ time ./sum-complex
  1.0e9 :+ 1.0e9
  ./sum-complex  3.57s user 0.01s system 99% cpu 3.574 total


Main_mainzuzdszdwfold_info:
.Lc1i7:
         addq $32,%r12
         cmpq 144(%r13),%r12
         ja .Lc1ia
         movq %r14,%rax
         cmpq $100000000,%rax
         jne .Lc1id
         movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
         movsd %xmm5,-16(%r12)
         movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
         movsd %xmm6,(%r12)
         leaq -7(%r12),%rbx
         leaq -23(%r12),%r14
         jmp *(%rbp)
.Lc1ia:
         movq $32,184(%r13)
         movl $Main_mainzuzdszdwfold_closure,%ebx
         addq $-24,%rbp
         movq %r14,(%rbp)
         movsd %xmm5,8(%rbp)
         movsd %xmm6,16(%rbp)
         jmp *-8(%r13)
.Lc1id:
         movsd %xmm6,%xmm0
         addsd .Ln1if(%rip),%xmm0
         movsd %xmm5,%xmm7
         addsd .Ln1ig(%rip),%xmm7
         leaq 1(%rax),%r14
         movsd %xmm7,%xmm5
         movsd %xmm0,%xmm6
         addq $-32,%r12
         jmp Main_mainzuzdszdwfold_info



I manged to improve this:

Main_mainzuzdszdwfold_info:
.Lc1lP:
        addq $32,%r12
        cmpq 144(%r13),%r12
        ja .Lc1lS
        movq %r14,%rax
        cmpq $1000000000,%rax
        jne .Lc1lV
        movq $ghczmprim_GHCziTypes_Dzh_con_info,-24(%r12)
        movsd %xmm6,-16(%r12)
        movq $ghczmprim_GHCziTypes_Dzh_con_info,-8(%r12)
        movsd %xmm5,(%r12)
        leaq -7(%r12),%rbx
        leaq -23(%r12),%r14
        jmp *(%rbp)
.Lc1lS:
        movq $32,184(%r13)
        movl $Main_mainzuzdszdwfold_closure,%ebx
        addq $-24,%rbp
        movsd %xmm5,(%rbp)
        movsd %xmm6,8(%rbp)
        movq %r14,16(%rbp)
        jmp *-8(%r13)
.Lc1lV:
        addsd .Ln1m2(%rip),%xmm5
        addsd .Ln1m3(%rip),%xmm6
        leaq 1(%rax),%r14
        addq $-32,%r12
        jmp Main_mainzuzdszdwfold_info


from 9 instructions in the last block down to 5 (one instruction fewer than gcc). I haven't commoned up the two constant 1's though, that'd mean doing some CSE.

On my machine with GHC HEAD and gcc 4.3.0, the gcc version runs in 2.0s, with the NCG at 2.3s. I put the difference down to a bit of instruction scheduling done by gcc, and that extra constant load.

But let's face it, all of this code is crappy. It should be a tiny little loop rather than a tail-call with argument passing, and that's what we'll get with the new backend (eventually). LLVM probably won't turn it into a loop on its own, that needs to be done before the code gets passed to LLVM.

Have you looked at this example on x86? It's *far* worse and runs about 5 times slower.

Cheers,
        Simon
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reply via email to