I tried a few alternative implementations and found that passing the gct 
variable as a function parameter in the garbage collector performed the best, 
with an average execution time increase of 6% for gc-intense programs compared 
to a gcc compiled version.

The patches for passing the gct variable as a parameter are available in the 
clang-param-pass branch here:

    git://github.com/dmpots/ghc.git clang-param-pass

Because passing gct as a parameter is an invasive change, I tried a few other 
techniques first.

First, I tried changing the gct definition to call pthread_getspecific directly 
(instead of going through getThreadLocalVar), but that only improved 
performance by a few percent. The call to pthread_getspecific was still 
incurring the dynamic linking overhead.

I then tried creating my "own" pthread_getspecific function that just contains 
the inline assembly to read the value as done in the pthread function. With 
this change I see the overhead drop to around 9% over the non-llvm-gcc version. 
The code for accessing the gct looked something like this:

    static inline gc_thread* __gct(void) {
      gc_thread *gct_tls;
      __asm__("movq %%gs:0x60(,%[key],8),%[gct_tls]" 
                : [gct_tls] "=r" (gct_tls) 
                : [key]     "r"  (gctKey));
      return gct_tls;
    }
    #define gct (__gct())
    #define DECLARE_GCT /* nothing */
    
In all the cases I saw, the __gct function was getting inlined correctly. 
Because I'm directly inlining assembly for the pthread function, I'm not sure 
how portable across MacOS X versions and I'm pretty sure it wouldn't work on 
linux.

Finally, I tried changing the GC to pass the gct variable as a parameter which 
further reduces the performance difference so that the llvm-gcc version is 
about 6% slower than the gcc version on the nofib gc benchmarks and 3.5% slower 
on the fibon benchmarks. My initial (accidental) measurements with the 
non-threaded runtime showed similar numbers, so part of this overhead is just 
the difference in code generation between llvm and gcc.

To support both passing gct as a parameter and accessing as a global variable I 
added some macros that can be used with GC functions that access (or call 
functions that access) the gct. These macros will add the gct as an extra 
parameter to the function if the PASS_GCT_AS_PARAM variable is defined. They 
are used like this:

    // declaration
    void someGcFunc(DECLARE_GCT_PARAM(orig_param_list))
    
    // call site
    someGcFunc(GCT_PARAM(orig_params))
    
It's a bit ugly to look at, but I couldn't think of a nice way to support both 
ways of accessing the gct.


On Jul 3, 2011, at 11:23 AM, David Peixotto wrote:

> 
> On Jul 2, 2011, at 2:42 PM, Simon Marlow wrote:
>> On 02/07/11 19:34, David Peixotto wrote:
>> 
>>> The best we could hope for would be for an access of `gct` to turn
>>> into something like this in the GC:
>>> 
>>>    movq    (%rdi),%rdi #deref the key which is an index into the tls memory
>>>    movq    %gs:0x00000060(,%rdi,8),%rax # read the value indexed by the key
>>> 
>>> but it looks like we are getting something like this:
>>> 
>>>    call getThreadLocalVar
>>>    movq    (%rdi),%rdi #deref the key which is an index into the tls memory
>>>    jmp<dynamic_linker_stub>
>>>    movq    %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
>>>    ret
>> 
>> you don't need to go through getThreadLocalVar, right?   Just call 
>> pthread_getspecific directly.  
> 
> Yeah, I can change it to be a direct call to pthread_getspecific. I was just 
> trying to reuse the existing GHC api for thread local storage, and I thought 
> the call would be inlined away.
> 
>> I don't know why it's going through the dynamic linker stub, I thought it 
>> was supposed to be #defined to the inline assembly.
> 
> I can't see any obvious definition in the header files on my machine. The 
> definition I found is an assembly file that is part of apples libc 
> implementation:
> 
> http://www.opensource.apple.com/source/Libc/Libc-594.9.5/x86_64/pthreads/pthread_getspecific.s
> 
> This definition seems to match what I see when I debug an executable in gdb.
> 
>> Anyway, the last resort will be to pass gct as a parameter to the critical 
>> functions in the GC - scavenge_block() and everything it calls transitively, 
>> including evacuate().  This is likely to give quite good performance, but 
>> not as good as a register variable, so unfortunately we'll need some 
>> #ifdefery or macros which will be quite ugly (hence why I say this is a last 
>> resort).
> 
> Ok, hopefully we won't have to resort to that, but I'm not too optimistic at 
> this point. If we are actually stuck dealing with the dynamic linker for 
> pthread_getspecific then the overhead is going to probably be too high.
> 
> 
> 
> _______________________________________________
> Cvs-ghc mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/cvs-ghc
> 


_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc

Reply via email to