Re: [llvm-commits] Trampoline changes

Duncan Sands Sun, 09 Sep 2007 01:45:44 -0700

Hi Nick,

> > It is easy to construct testcases for which this works, but sadly I was 
> > unable to get
> > the optimization to occur even once in a trampoline heavy real-world 
> > program without
> > jacking-up the inlining limit hugely (to 100000; 10000 wasn't enough).  
> > Still, I'm
> > hoping that it may sometimes be useful.
> 
> Why does changing the inliner threshold help? Are you talking about 
> inlining the trampoline itself or the llvm.init.trampoline intrinsic? 
> Could we adjust the weights in the inliner so that it does it more 
> often, seeing as a trampoline call is much more expensive than a regular 
> call?
> 
> Would you mind sending a small before and after example for the 
> transformation implemented by your instcombine patch?


consider this example:

int process(int (*func)(int)) {
  return func(1);
}

int nest(int n)
{
  int f(int m) {
    return m == n;
  }
  return process(&f);
}

A pointer to the nested function "f" is passed to "process".  Note that f 
accesses "n", an
argument to the parent function "nest".  When process calls f, somehow f has to 
be able to
get the value of n, even though n wasn't passed to process.  This is done using 
black magic.
In fact process is not passed a pointer to f.  Instead a small piece of code is 
contructed
on the stack of nest, and a pointer to this code (the trampoline) is passed to 
process instead.
The trampoline code calls f, passing it any extra info it needs, like n.

To see how this works exactly, consider the corresponding bitcode:

        %struct.FRAME.nest = type { i32, i32 (i32)* }
        %struct.__builtin_trampoline = type { [10 x i8] }

declare i8* @llvm.init.trampoline(i8*, i8*, i8*)

define i32 @process(i32 (i32)* %func) {
entry:
        %tmp2 = tail call i32 %func( i32 1 )            ; <i32> [#uses=1]
        ret i32 %tmp2
}

define internal i32 @f.934(%struct.FRAME.nest* nest  %CHAIN.1, i32 %m) {
entry:
        %tmp3 = getelementptr %struct.FRAME.nest* %CHAIN.1, i32 0, i32 0        
        ; <i32*> [#uses=1]
        %tmp4 = load i32* %tmp3, align 4                ; <i32> [#uses=1]
        %tmp7 = icmp eq i32 %tmp4, %m           ; <i1> [#uses=1]
        %tmp78 = zext i1 %tmp7 to i32           ; <i32> [#uses=1]
        ret i32 %tmp78
}

define i32 @nest(i32 %n) {
entry:
        %FRAME.0 = alloca %struct.FRAME.nest, align 8           ; 
<%struct.FRAME.nest*> [#uses=3]
        %TRAMP.216 = alloca [10 x i8], align 16         ; <[10 x i8]*> [#uses=1]
        %TRAMP.216.sub = getelementptr [10 x i8]* %TRAMP.216, i32 0, i32 0      
        ; <i8*> [#uses=1]
        %tmp3 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 0        
        ; <i32*> [#uses=1]
        store i32 %n, i32* %tmp3, align 8
        %FRAME.06 = bitcast %struct.FRAME.nest* %FRAME.0 to i8*         ; <i8*> 
[#uses=1]
        %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* 
bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 )  
       ; <i8*> [#uses=1]
        %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1        
        ; <i32 (i32)**> [#uses=1]
        %tmp89 = bitcast i8* %tramp to i32 (i32)*               ; <i32 (i32)*> 
[#uses=2]
        store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8
        %tmp13 = call i32 @process( i32 (i32)* %tmp89 )         ; <i32> 
[#uses=1]
        ret i32 %tmp13
}

First note that the nested function f has been turned into @f.934 and all of a 
sudden has an extra
parameter %CHAIN.1  This parameter is used to pass in the arguments and 
variables (such as n) of the
parent function nest.  If you look closely you will see that it is a pointer to 
a struct that holds
both n and a function pointer suitable for calling f (in more complicated 
situations this function
pointer needs to live in this frame struct, but here it would have been OK to 
simply be another
variable on nest's stack).  This function pointer is %tramp, the result of the 
@llvm.init.trampoline
call.  It is %tramp that is passed to process, after being bitcast to the right 
type.

Now take a look at this part of nest after inlining @process:

        %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* 
bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 )  
       ; <i8*> [#uses=1]
        %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1        
        ; <i32 (i32)**> [#uses=1]
        %tmp89 = bitcast i8* %tramp to i32 (i32)*               ; <i32 (i32)*> 
[#uses=2]
        store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8
        %tmp2.i = call i32 %tmp89( i32 1 )              ; <i32> [#uses=1]

The use of a trampoline is now pointless: since we are inside nest the extra 
argument to f
is directly available.  So we could rewrite the "call i32 %tmp89( i32 1 )" with 
a direct call
to f.  This is what the new optimization does.  Here is the result:

        %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* 
bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 )  
       ; <i8*> [#uses=1]
        %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1        
        ; <i32 (i32)**> [#uses=1]
        %tmp89 = bitcast i8* %tramp to i32 (i32)*               ; <i32 (i32)*> 
[#uses=1]
        store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8
        %nest = bitcast i8* %FRAME.06 to %struct.FRAME.nest*            ; 
<%struct.FRAME.nest*> [#uses=1]
        %tmp2.i1 = call i32 @f.934( %struct.FRAME.nest* %nest nest , i32 1 )    
        ; <i32> [#uses=1]

The call to the trampoline has been replaced by a direct call to @f.934, with 
the extra argument it requires
spliced in.  Note that the call to init_trampoline has not been removed even 
though it is now useless.  I hope
to do something about this later.

Let me now explain the inlining problem.  First note that in practice the only 
reason to take a pointer to
a nested function is to pass them to some outside routine like process.  Since 
otherwise you could just
call the nested function directly.  Second, in order to simplify a trampoline 
call the extra arguments ("n")
to the nested function ("f") have to be available at the point of the call.  
This is only going to happen
if process (or whatever) has been inlined into the parent function ("nest").  
But inlining is only going
to happen if the outside function is small.

The Ada containers library expects users to pass in function pointers and in 
many situations you are
obliged to pass in a pointer to a nested function.  Since the gcc Ada front-end 
implements these pointers
using trampolines this means that using the container library results in a lot 
of trampoline calls.
(The front-end could have made use of "fat pointers" instead, avoiding 
trampolines; I hear that they plan
to do this at some point exactly because of the cost of trampolines, but for 
the moment it's trampolines).
My hope was that some of these trampoline calls could be eliminated using this 
optimization.  But for
that to happen, the routines in the container library have to be small enough 
to be inlined.  They are
not.  In fact they're rather small in terms of lines of code, but exception 
handling etc really bulks
up the LLVM bitcode making the inliner pass them over.

Ciao,

Duncan.
_______________________________________________
llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

Re: [llvm-commits] Trampoline changes

Reply via email to