Re: GCC, Mac OS X & the future

David Peixotto Sat, 02 Jul 2011 11:34:47 -0700

I'm glad you caught my benchmarking error because the new results look quite 
different! Running the benchmarks with the -threaded runtime shows that the 
actual slowdown is close to 30% for GC-intense programs.


In the fibon results, the average execution time was 12% slower for llvm-gcc, 
and the average GC slowdown was 42%. In the nofib gc benchmark results, the 
average execution time for llvm-gcc was 30% longer.

While the results are disappointing, they seem reasonable after taking a look 
at the code generated for the access of the `gct` variable in the GC. I had 
hoped using pthread_getspecific would just require a few inline assembly 
instructions, but it looks like the overhead is much higher. When accessing  
the `gct` variable in the GC it calls `getThreadLocalVar` which is the GHC 
wrapper for pthread_getspecific. Then the actual call to pthread_getspecific 
goes through the dynamic linker so we take an extra hit there. The actual code 
for pthread_getspecific is just a mov followed by a return. 

The best we could hope for would be for an access of `gct` to turn into 
something like this in the GC:

    movq    (%rdi),%rdi #deref the key which is an index into the tls memory
    movq    %gs:0x00000060(,%rdi,8),%rax # read the value indexed by the key
    
but it looks like we are getting something like this:

    call getThreadLocalVar
    movq    (%rdi),%rdi #deref the key which is an index into the tls memory
    jmp <dynamic_linker_stub>
    movq    %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
    ret

The call to getThreadLocalVar may be getting inlined in some places, but not at 
the site I examined. I've include the detailed benchmark results below.

For the fibon results, a negative number indicates that llvm-gcc is slower. 
Efficiency is the percent of the total execution time spent in GC.

Fibon Results
-----------------------------------------------------------------
                MutCPUTime    GCCPUTime TotalCPUTime   Efficiency             
-----------------------------------------------------------------
Agum                +0.17%      -53.48%      -11.49%       80.32%             
BinaryTrees         +0.13%      -60.70%      -22.98%       68.96%             
Blur                -0.26%      -15.27%       -0.43%       98.58%             
Bzlib               -2.31%       -8.57%       -2.32%       99.89%             
Chameneos          -15.65%      -37.53%      -15.74%       99.53%             
Cpsa                -0.02%      -58.13%       -5.25%       91.32%             
Crypto              -0.59%      -47.97%      -27.08%       52.18%             
FFT2d               +2.09%      -33.66%       +0.30%       94.64%             
FFT3d               -1.58%      -12.19%       -1.88%       96.58%             
Fannkuch            -0.84%      -26.99%       -2.59%       92.64%             
Fgl                 -0.40%      -50.78%      -21.16%       63.74%             
Fst                 +0.32%      -66.43%      -13.21%       81.93%             
Funsat              -1.36%      -44.08%      -18.94%       65.30%             
Gf                  -5.38%      -44.43%      -17.56%       77.11%             
HaLeX               +3.77%      -66.52%       +1.13%       96.30%             
Happy               -0.98%      -59.06%      -25.67%       64.51%             
Hgalib              -2.45%      -44.33%       -5.96%       91.67%             
Laplace             +0.43%      -23.42%       -0.63%       95.09%             
MMult               +1.04%      -13.62%       +0.48%       95.34%             
Mandelbrot          +0.06%      -17.29%       +0.03%       99.78%             
Nbody               -0.69%      -18.18%       -0.82%       98.99%             
Palindromes         -2.83%      -82.54%      -52.78%       57.72%             
Pappy               +0.17%      -44.32%      -38.64%       34.84%             
Pidigits            +0.17%      -57.56%      -11.34%       81.62%             
QuickCheck          +0.36%      -50.14%       -6.52%       87.62%             
Regex               -1.14%      -35.26%       -2.78%       94.79%             
Simgi               +1.39%      -41.70%      -10.15%       74.64%             
SpectralNorm        +0.06%         ----       +0.06%      100.00%             
TernaryTrees        +1.59%      -48.39%      -23.62%       58.03%             
Xsact               -0.72%      -61.65%      -28.25%       63.44%             
-----------------------------------------------------------------
Min                -15.65%      -82.54%      -52.78%       34.84%             
Mean                -0.85%      -42.21%      -12.19%       81.90%             
Max                 +3.77%       -8.57%       +1.13%      100.00%             
                          

For the nofib results, a positive number means the llvm-gcc version was slower.

NoFib Results
------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
------------------------------------------------------------------------------
        circsim          +0.0%     +0.0%    +22.5%    +21.2%     -0.2%
    constraints          +0.0%     +0.0%    +39.4%    +38.3%     +0.0%
         fulsom          +0.0%     +0.0%    +23.7%    +22.2%     +7.1%
       gc_bench          +0.1%     +0.0%    +68.7%    +67.8%     +0.3%
          happy          +0.1%     +0.0%    +14.8%    +14.4%     +0.0%
           lcss          +0.1%     +0.0%    +34.3%    +31.6%     +0.0%
      mutstore1          +0.0%     +0.0%    +41.3%    +35.6%     +0.0%
      mutstore2          +0.0%     +0.0%    +24.3%    +23.4%     +0.0%
          power          +0.0%     +0.0%    +34.6%    +35.1%     +0.0%
     spellcheck          +0.1%     +0.0%    +11.8%    +11.9%     +0.0%
------------------------------------------------------------------------------
            Min          +0.0%     +0.0%    +11.8%    +11.9%     -0.2%
            Max          +0.1%     +0.0%    +68.7%    +67.8%     +7.1%
 Geometric Mean          +0.0%     +0.0%    +30.7%    +29.3%     +0.7%

On Jul 1, 2011, at 2:45 PM, David Peixotto wrote:

> 
> On Jul 1, 2011, at 2:05 PM, Simon Marlow wrote:
> 
>> On 30/06/11 17:43, David Peixotto wrote:
>>> I have made the changes necessary to compile GHC with llvm-gcc. The
>>> major change was to use the pthread api for thread level storage to
>>> access the gct variable during garbage collection. My measurements
>>> indicate this causes an average slowdown of about 5% for gc heavy
>>> programs. The changes are available from the `clang` branch on my
>>> github fork.
>> 
>> Sounds good.  One question: did you measure the GC performance with 
>> -threaded?  Because the thread-specific variable in the GC is only used with 
>> -threaded.
>> 
> 
> Oops, I totally forgot about that :\ Those numbers were actually for the 
> non-threaded runtime, so they don't measure the changes to the GC just the 
> difference in compiling with llvm-gcc. I'll rerun the benchmarks with 
> -threaded. Sorry about that!
> 
> 
> _______________________________________________
> Cvs-ghc mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/cvs-ghc
> 


_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: GCC, Mac OS X & the future

Reply via email to