dons: > marlowsd: > >>> > >>> Simon Marlow has recently fixed FP performance for modern x86 chips in > >>> the native code generator in the HEAD. That was the last reason we know > >>> of to prefer via-C to the native code generators. But before we start > >>> the removal process, does anyone know of any other problems with the > >>> native code generators that need to be fixed first? > >>> > >> > >> Do we have the blessing of the DPH team, wrt. tight, numeric inner loops? > >> > >> As recently as last year -fvia-C -optc-O3 was still useful for some > >> microbenchmarks -- what's changed in that time, or is expected to change? > > > > If you have benchmarks that show a significant difference, I'd be > > interested to see them. > > I've attached an example where there's a 40% variation (and it's a > floating point benchmark). Roman would be seeing similar examples in the > vector code.
Here's an example that doesn't use floating point: import Data.Array.Vector import Data.Bits main = print . sumU $ zipWith3U (\x y z -> x * y * z) (enumFromToU 1 (100000000 :: Int)) (enumFromToU 2 (100000001 :: Int)) (enumFromToU 7 (100000008 :: Int)) In core: main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int# main_$s$wfold = \ (sc_s1l1 :: Int#) (sc1_s1l2 :: Int#) (sc2_s1l3 :: Int#) (sc3_s1l4 :: Int#) -> case ># sc2_s1l3 100000000 of _ { False -> case ># sc1_s1l2 100000001 of _ { False -> case ># sc_s1l1 100000008 of _ { False -> main_$s$wfold (+# sc_s1l1 1) (+# sc1_s1l2 1) (+# sc2_s1l3 1) (+# sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1)); True -> sc3_s1l4 }; True -> sc3_s1l4 }; True -> sc3_s1l4 } Rather nice! -fvia-C -optc-O3 Main_mainzuzdszdwfold_info: cmpq $100000000, %rdi jg .L6 cmpq $100000001, %rsi jg .L6 cmpq $100000008, %r14 jle .L10 .L6: movq %r8, %rbx movq (%rbp), %rax jmp *%rax .L10: movq %rsi, %r10 leaq 1(%rsi), %rsi imulq %rdi, %r10 leaq 1(%rdi), %rdi imulq %r14, %r10 leaq 1(%r14), %r14 leaq (%r10,%r8), %r8 jmp Main_mainzuzdszdwfold_info Which looks ok. $ time ./zipwith3 3541230156834269568 ./zipwith3 0.33s user 0.00s system 99% cpu 0.337 total And -fasm we get very different code, and a bit of a slowdown: Main_mainzuzdszdwfold_info: .Lc1mo: cmpq $100000000,%rdi jg .Lc1mq cmpq $100000001,%rsi jg .Lc1ms cmpq $100000008,%r14 jg .Lc1mv movq %rsi,%rax imulq %r14,%rax movq %rdi,%rcx imulq %rax,%rcx movq %r8,%rax addq %rcx,%rax leaq 1(%rdi),%rcx leaq 1(%rsi),%rdx incq %r14 movq %rdx,%rsi movq %rcx,%rdi movq %rax,%r8 jmp Main_mainzuzdszdwfold_info .Lc1mq: movq %r8,%rbx jmp *(%rbp) .Lc1ms: movq %r8,%rbx jmp *(%rbp) .Lc1mv: movq %r8,%rbx jmp *(%rbp) Slower: $ time ./zipwith3 3541230156834269568 ./zipwith3 0.38s user 0.00s system 98% cpu 0.384 total Now maybe we need to wait on the new backend optimizations to get there? -- Don _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users