I'm glad you caught my benchmarking error because the new results look quite
different! Running the benchmarks with the -threaded runtime shows that the
actual slowdown is close to 30% for GC-intense programs.
In the fibon results, the average execution time was 12% slower for llvm-gcc,
and the average GC slowdown was 42%. In the nofib gc benchmark results, the
average execution time for llvm-gcc was 30% longer.
While the results are disappointing, they seem reasonable after taking a look
at the code generated for the access of the `gct` variable in the GC. I had
hoped using pthread_getspecific would just require a few inline assembly
instructions, but it looks like the overhead is much higher. When accessing
the `gct` variable in the GC it calls `getThreadLocalVar` which is the GHC
wrapper for pthread_getspecific. Then the actual call to pthread_getspecific
goes through the dynamic linker so we take an extra hit there. The actual code
for pthread_getspecific is just a mov followed by a return.
The best we could hope for would be for an access of `gct` to turn into
something like this in the GC:
movq (%rdi),%rdi #deref the key which is an index into the tls memory
movq %gs:0x00000060(,%rdi,8),%rax # read the value indexed by the key
but it looks like we are getting something like this:
call getThreadLocalVar
movq (%rdi),%rdi #deref the key which is an index into the tls memory
jmp <dynamic_linker_stub>
movq %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
ret
The call to getThreadLocalVar may be getting inlined in some places, but not at
the site I examined. I've include the detailed benchmark results below.
For the fibon results, a negative number indicates that llvm-gcc is slower.
Efficiency is the percent of the total execution time spent in GC.
Fibon Results
-----------------------------------------------------------------
MutCPUTime GCCPUTime TotalCPUTime Efficiency
-----------------------------------------------------------------
Agum +0.17% -53.48% -11.49% 80.32%
BinaryTrees +0.13% -60.70% -22.98% 68.96%
Blur -0.26% -15.27% -0.43% 98.58%
Bzlib -2.31% -8.57% -2.32% 99.89%
Chameneos -15.65% -37.53% -15.74% 99.53%
Cpsa -0.02% -58.13% -5.25% 91.32%
Crypto -0.59% -47.97% -27.08% 52.18%
FFT2d +2.09% -33.66% +0.30% 94.64%
FFT3d -1.58% -12.19% -1.88% 96.58%
Fannkuch -0.84% -26.99% -2.59% 92.64%
Fgl -0.40% -50.78% -21.16% 63.74%
Fst +0.32% -66.43% -13.21% 81.93%
Funsat -1.36% -44.08% -18.94% 65.30%
Gf -5.38% -44.43% -17.56% 77.11%
HaLeX +3.77% -66.52% +1.13% 96.30%
Happy -0.98% -59.06% -25.67% 64.51%
Hgalib -2.45% -44.33% -5.96% 91.67%
Laplace +0.43% -23.42% -0.63% 95.09%
MMult +1.04% -13.62% +0.48% 95.34%
Mandelbrot +0.06% -17.29% +0.03% 99.78%
Nbody -0.69% -18.18% -0.82% 98.99%
Palindromes -2.83% -82.54% -52.78% 57.72%
Pappy +0.17% -44.32% -38.64% 34.84%
Pidigits +0.17% -57.56% -11.34% 81.62%
QuickCheck +0.36% -50.14% -6.52% 87.62%
Regex -1.14% -35.26% -2.78% 94.79%
Simgi +1.39% -41.70% -10.15% 74.64%
SpectralNorm +0.06% ---- +0.06% 100.00%
TernaryTrees +1.59% -48.39% -23.62% 58.03%
Xsact -0.72% -61.65% -28.25% 63.44%
-----------------------------------------------------------------
Min -15.65% -82.54% -52.78% 34.84%
Mean -0.85% -42.21% -12.19% 81.90%
Max +3.77% -8.57% +1.13% 100.00%
For the nofib results, a positive number means the llvm-gcc version was slower.
NoFib Results
------------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
------------------------------------------------------------------------------
circsim +0.0% +0.0% +22.5% +21.2% -0.2%
constraints +0.0% +0.0% +39.4% +38.3% +0.0%
fulsom +0.0% +0.0% +23.7% +22.2% +7.1%
gc_bench +0.1% +0.0% +68.7% +67.8% +0.3%
happy +0.1% +0.0% +14.8% +14.4% +0.0%
lcss +0.1% +0.0% +34.3% +31.6% +0.0%
mutstore1 +0.0% +0.0% +41.3% +35.6% +0.0%
mutstore2 +0.0% +0.0% +24.3% +23.4% +0.0%
power +0.0% +0.0% +34.6% +35.1% +0.0%
spellcheck +0.1% +0.0% +11.8% +11.9% +0.0%
------------------------------------------------------------------------------
Min +0.0% +0.0% +11.8% +11.9% -0.2%
Max +0.1% +0.0% +68.7% +67.8% +7.1%
Geometric Mean +0.0% +0.0% +30.7% +29.3% +0.7%
On Jul 1, 2011, at 2:45 PM, David Peixotto wrote:
>
> On Jul 1, 2011, at 2:05 PM, Simon Marlow wrote:
>
>> On 30/06/11 17:43, David Peixotto wrote:
>>> I have made the changes necessary to compile GHC with llvm-gcc. The
>>> major change was to use the pthread api for thread level storage to
>>> access the gct variable during garbage collection. My measurements
>>> indicate this causes an average slowdown of about 5% for gc heavy
>>> programs. The changes are available from the `clang` branch on my
>>> github fork.
>>
>> Sounds good. One question: did you measure the GC performance with
>> -threaded? Because the thread-specific variable in the GC is only used with
>> -threaded.
>>
>
> Oops, I totally forgot about that :\ Those numbers were actually for the
> non-threaded runtime, so they don't measure the changes to the GC just the
> difference in compiling with llvm-gcc. I'll rerun the benchmarks with
> -threaded. Sorry about that!
>
>
> _______________________________________________
> Cvs-ghc mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/cvs-ghc
>
_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc