On 02/07/11 19:34, David Peixotto wrote:

I'm glad you caught my benchmarking error because the new results
look quite different! Running the benchmarks with the -threaded
runtime shows that the actual slowdown is close to 30% for GC-intense
programs.

In the fibon results, the average execution time was 12% slower for
llvm-gcc, and the average GC slowdown was 42%. In the nofib gc
benchmark results, the average execution time for llvm-gcc was 30%
longer.

Ok, that's bad. I'm not a Mac user, but I wouldn't put up with more than 5% (and I'd be very unhappy about that).

While the results are disappointing, they seem reasonable after
taking a look at the code generated for the access of the `gct`
variable in the GC. I had hoped using pthread_getspecific would just
require a few inline assembly instructions, but it looks like the
overhead is much higher. When accessing  the `gct` variable in the GC
it calls `getThreadLocalVar` which is the GHC wrapper for
pthread_getspecific. Then the actual call to pthread_getspecific goes
through the dynamic linker so we take an extra hit there. The actual
code for pthread_getspecific is just a mov followed by a return.

The best we could hope for would be for an access of `gct` to turn
into something like this in the GC:
>
     movq    (%rdi),%rdi #deref the key which is an index into the tls memory
     movq    %gs:0x00000060(,%rdi,8),%rax # read the value indexed by the key

but it looks like we are getting something like this:

     call getThreadLocalVar
     movq    (%rdi),%rdi #deref the key which is an index into the tls memory
     jmp<dynamic_linker_stub>
     movq    %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
     ret

you don't need to go through getThreadLocalVar, right? Just call pthread_getspecific directly. I don't know why it's going through the dynamic linker stub, I thought it was supposed to be #defined to the inline assembly.

Anyway, the last resort will be to pass gct as a parameter to the critical functions in the GC - scavenge_block() and everything it calls transitively, including evacuate(). This is likely to give quite good performance, but not as good as a register variable, so unfortunately we'll need some #ifdefery or macros which will be quite ugly (hence why I say this is a last resort).

Cheers,
        Simon


The call to getThreadLocalVar may be getting inlined in some places, but not at 
the site I examined. I've include the detailed benchmark results below.

For the fibon results, a negative number indicates that llvm-gcc is slower. 
Efficiency is the percent of the total execution time spent in GC.

Fibon Results
-----------------------------------------------------------------
                 MutCPUTime    GCCPUTime TotalCPUTime   Efficiency
-----------------------------------------------------------------
Agum                +0.17%      -53.48%      -11.49%       80.32%
BinaryTrees         +0.13%      -60.70%      -22.98%       68.96%
Blur                -0.26%      -15.27%       -0.43%       98.58%
Bzlib               -2.31%       -8.57%       -2.32%       99.89%
Chameneos          -15.65%      -37.53%      -15.74%       99.53%
Cpsa                -0.02%      -58.13%       -5.25%       91.32%
Crypto              -0.59%      -47.97%      -27.08%       52.18%
FFT2d               +2.09%      -33.66%       +0.30%       94.64%
FFT3d               -1.58%      -12.19%       -1.88%       96.58%
Fannkuch            -0.84%      -26.99%       -2.59%       92.64%
Fgl                 -0.40%      -50.78%      -21.16%       63.74%
Fst                 +0.32%      -66.43%      -13.21%       81.93%
Funsat              -1.36%      -44.08%      -18.94%       65.30%
Gf                  -5.38%      -44.43%      -17.56%       77.11%
HaLeX               +3.77%      -66.52%       +1.13%       96.30%
Happy               -0.98%      -59.06%      -25.67%       64.51%
Hgalib              -2.45%      -44.33%       -5.96%       91.67%
Laplace             +0.43%      -23.42%       -0.63%       95.09%
MMult               +1.04%      -13.62%       +0.48%       95.34%
Mandelbrot          +0.06%      -17.29%       +0.03%       99.78%
Nbody               -0.69%      -18.18%       -0.82%       98.99%
Palindromes         -2.83%      -82.54%      -52.78%       57.72%
Pappy               +0.17%      -44.32%      -38.64%       34.84%
Pidigits            +0.17%      -57.56%      -11.34%       81.62%
QuickCheck          +0.36%      -50.14%       -6.52%       87.62%
Regex               -1.14%      -35.26%       -2.78%       94.79%
Simgi               +1.39%      -41.70%      -10.15%       74.64%
SpectralNorm        +0.06%         ----       +0.06%      100.00%
TernaryTrees        +1.59%      -48.39%      -23.62%       58.03%
Xsact               -0.72%      -61.65%      -28.25%       63.44%
-----------------------------------------------------------------
Min                -15.65%      -82.54%      -52.78%       34.84%
Mean                -0.85%      -42.21%      -12.19%       81.90%
Max                 +3.77%       -8.57%       +1.13%      100.00%


For the nofib results, a positive number means the llvm-gcc version was slower.

NoFib Results
------------------------------------------------------------------------------
         Program           Size    Allocs   Runtime   Elapsed  TotalMem
------------------------------------------------------------------------------
         circsim          +0.0%     +0.0%    +22.5%    +21.2%     -0.2%
     constraints          +0.0%     +0.0%    +39.4%    +38.3%     +0.0%
          fulsom          +0.0%     +0.0%    +23.7%    +22.2%     +7.1%
        gc_bench          +0.1%     +0.0%    +68.7%    +67.8%     +0.3%
           happy          +0.1%     +0.0%    +14.8%    +14.4%     +0.0%
            lcss          +0.1%     +0.0%    +34.3%    +31.6%     +0.0%
       mutstore1          +0.0%     +0.0%    +41.3%    +35.6%     +0.0%
       mutstore2          +0.0%     +0.0%    +24.3%    +23.4%     +0.0%
           power          +0.0%     +0.0%    +34.6%    +35.1%     +0.0%
      spellcheck          +0.1%     +0.0%    +11.8%    +11.9%     +0.0%
------------------------------------------------------------------------------
             Min          +0.0%     +0.0%    +11.8%    +11.9%     -0.2%
             Max          +0.1%     +0.0%    +68.7%    +67.8%     +7.1%
  Geometric Mean          +0.0%     +0.0%    +30.7%    +29.3%     +0.7%

On Jul 1, 2011, at 2:45 PM, David Peixotto wrote:


On Jul 1, 2011, at 2:05 PM, Simon Marlow wrote:

On 30/06/11 17:43, David Peixotto wrote:
I have made the changes necessary to compile GHC with llvm-gcc. The
major change was to use the pthread api for thread level storage to
access the gct variable during garbage collection. My measurements
indicate this causes an average slowdown of about 5% for gc heavy
programs. The changes are available from the `clang` branch on my
github fork.

Sounds good.  One question: did you measure the GC performance with -threaded?  
Because the thread-specific variable in the GC is only used with -threaded.


Oops, I totally forgot about that :\ Those numbers were actually for the 
non-threaded runtime, so they don't measure the changes to the GC just the 
difference in compiling with llvm-gcc. I'll rerun the benchmarks with 
-threaded. Sorry about that!


_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc




_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc

Reply via email to