On 02/07/11 19:34, David Peixotto wrote:
I'm glad you caught my benchmarking error because the new results
look quite different! Running the benchmarks with the -threaded
runtime shows that the actual slowdown is close to 30% for GC-intense
programs.
In the fibon results, the average execution time was 12% slower for
llvm-gcc, and the average GC slowdown was 42%. In the nofib gc
benchmark results, the average execution time for llvm-gcc was 30%
longer.
Ok, that's bad. I'm not a Mac user, but I wouldn't put up with more
than 5% (and I'd be very unhappy about that).
While the results are disappointing, they seem reasonable after
taking a look at the code generated for the access of the `gct`
variable in the GC. I had hoped using pthread_getspecific would just
require a few inline assembly instructions, but it looks like the
overhead is much higher. When accessing the `gct` variable in the GC
it calls `getThreadLocalVar` which is the GHC wrapper for
pthread_getspecific. Then the actual call to pthread_getspecific goes
through the dynamic linker so we take an extra hit there. The actual
code for pthread_getspecific is just a mov followed by a return.
The best we could hope for would be for an access of `gct` to turn
into something like this in the GC:
>
movq (%rdi),%rdi #deref the key which is an index into the tls memory
movq %gs:0x00000060(,%rdi,8),%rax # read the value indexed by the key
but it looks like we are getting something like this:
call getThreadLocalVar
movq (%rdi),%rdi #deref the key which is an index into the tls memory
jmp<dynamic_linker_stub>
movq %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
ret
you don't need to go through getThreadLocalVar, right? Just call
pthread_getspecific directly. I don't know why it's going through the
dynamic linker stub, I thought it was supposed to be #defined to the
inline assembly.
Anyway, the last resort will be to pass gct as a parameter to the
critical functions in the GC - scavenge_block() and everything it calls
transitively, including evacuate(). This is likely to give quite good
performance, but not as good as a register variable, so unfortunately
we'll need some #ifdefery or macros which will be quite ugly (hence why
I say this is a last resort).
Cheers,
Simon
The call to getThreadLocalVar may be getting inlined in some places, but not at
the site I examined. I've include the detailed benchmark results below.
For the fibon results, a negative number indicates that llvm-gcc is slower.
Efficiency is the percent of the total execution time spent in GC.
Fibon Results
-----------------------------------------------------------------
MutCPUTime GCCPUTime TotalCPUTime Efficiency
-----------------------------------------------------------------
Agum +0.17% -53.48% -11.49% 80.32%
BinaryTrees +0.13% -60.70% -22.98% 68.96%
Blur -0.26% -15.27% -0.43% 98.58%
Bzlib -2.31% -8.57% -2.32% 99.89%
Chameneos -15.65% -37.53% -15.74% 99.53%
Cpsa -0.02% -58.13% -5.25% 91.32%
Crypto -0.59% -47.97% -27.08% 52.18%
FFT2d +2.09% -33.66% +0.30% 94.64%
FFT3d -1.58% -12.19% -1.88% 96.58%
Fannkuch -0.84% -26.99% -2.59% 92.64%
Fgl -0.40% -50.78% -21.16% 63.74%
Fst +0.32% -66.43% -13.21% 81.93%
Funsat -1.36% -44.08% -18.94% 65.30%
Gf -5.38% -44.43% -17.56% 77.11%
HaLeX +3.77% -66.52% +1.13% 96.30%
Happy -0.98% -59.06% -25.67% 64.51%
Hgalib -2.45% -44.33% -5.96% 91.67%
Laplace +0.43% -23.42% -0.63% 95.09%
MMult +1.04% -13.62% +0.48% 95.34%
Mandelbrot +0.06% -17.29% +0.03% 99.78%
Nbody -0.69% -18.18% -0.82% 98.99%
Palindromes -2.83% -82.54% -52.78% 57.72%
Pappy +0.17% -44.32% -38.64% 34.84%
Pidigits +0.17% -57.56% -11.34% 81.62%
QuickCheck +0.36% -50.14% -6.52% 87.62%
Regex -1.14% -35.26% -2.78% 94.79%
Simgi +1.39% -41.70% -10.15% 74.64%
SpectralNorm +0.06% ---- +0.06% 100.00%
TernaryTrees +1.59% -48.39% -23.62% 58.03%
Xsact -0.72% -61.65% -28.25% 63.44%
-----------------------------------------------------------------
Min -15.65% -82.54% -52.78% 34.84%
Mean -0.85% -42.21% -12.19% 81.90%
Max +3.77% -8.57% +1.13% 100.00%
For the nofib results, a positive number means the llvm-gcc version was slower.
NoFib Results
------------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
------------------------------------------------------------------------------
circsim +0.0% +0.0% +22.5% +21.2% -0.2%
constraints +0.0% +0.0% +39.4% +38.3% +0.0%
fulsom +0.0% +0.0% +23.7% +22.2% +7.1%
gc_bench +0.1% +0.0% +68.7% +67.8% +0.3%
happy +0.1% +0.0% +14.8% +14.4% +0.0%
lcss +0.1% +0.0% +34.3% +31.6% +0.0%
mutstore1 +0.0% +0.0% +41.3% +35.6% +0.0%
mutstore2 +0.0% +0.0% +24.3% +23.4% +0.0%
power +0.0% +0.0% +34.6% +35.1% +0.0%
spellcheck +0.1% +0.0% +11.8% +11.9% +0.0%
------------------------------------------------------------------------------
Min +0.0% +0.0% +11.8% +11.9% -0.2%
Max +0.1% +0.0% +68.7% +67.8% +7.1%
Geometric Mean +0.0% +0.0% +30.7% +29.3% +0.7%
On Jul 1, 2011, at 2:45 PM, David Peixotto wrote:
On Jul 1, 2011, at 2:05 PM, Simon Marlow wrote:
On 30/06/11 17:43, David Peixotto wrote:
I have made the changes necessary to compile GHC with llvm-gcc. The
major change was to use the pthread api for thread level storage to
access the gct variable during garbage collection. My measurements
indicate this causes an average slowdown of about 5% for gc heavy
programs. The changes are available from the `clang` branch on my
github fork.
Sounds good. One question: did you measure the GC performance with -threaded?
Because the thread-specific variable in the GC is only used with -threaded.
Oops, I totally forgot about that :\ Those numbers were actually for the
non-threaded runtime, so they don't measure the changes to the GC just the
difference in compiling with llvm-gcc. I'll rerun the benchmarks with
-threaded. Sorry about that!
_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc
_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc