Hi Nick, > > It is easy to construct testcases for which this works, but sadly I was > > unable to get > > the optimization to occur even once in a trampoline heavy real-world > > program without > > jacking-up the inlining limit hugely (to 100000; 10000 wasn't enough). > > Still, I'm > > hoping that it may sometimes be useful. > > Why does changing the inliner threshold help? Are you talking about > inlining the trampoline itself or the llvm.init.trampoline intrinsic? > Could we adjust the weights in the inliner so that it does it more > often, seeing as a trampoline call is much more expensive than a regular > call? > > Would you mind sending a small before and after example for the > transformation implemented by your instcombine patch?
consider this example: int process(int (*func)(int)) { return func(1); } int nest(int n) { int f(int m) { return m == n; } return process(&f); } A pointer to the nested function "f" is passed to "process". Note that f accesses "n", an argument to the parent function "nest". When process calls f, somehow f has to be able to get the value of n, even though n wasn't passed to process. This is done using black magic. In fact process is not passed a pointer to f. Instead a small piece of code is contructed on the stack of nest, and a pointer to this code (the trampoline) is passed to process instead. The trampoline code calls f, passing it any extra info it needs, like n. To see how this works exactly, consider the corresponding bitcode: %struct.FRAME.nest = type { i32, i32 (i32)* } %struct.__builtin_trampoline = type { [10 x i8] } declare i8* @llvm.init.trampoline(i8*, i8*, i8*) define i32 @process(i32 (i32)* %func) { entry: %tmp2 = tail call i32 %func( i32 1 ) ; <i32> [#uses=1] ret i32 %tmp2 } define internal i32 @f.934(%struct.FRAME.nest* nest %CHAIN.1, i32 %m) { entry: %tmp3 = getelementptr %struct.FRAME.nest* %CHAIN.1, i32 0, i32 0 ; <i32*> [#uses=1] %tmp4 = load i32* %tmp3, align 4 ; <i32> [#uses=1] %tmp7 = icmp eq i32 %tmp4, %m ; <i1> [#uses=1] %tmp78 = zext i1 %tmp7 to i32 ; <i32> [#uses=1] ret i32 %tmp78 } define i32 @nest(i32 %n) { entry: %FRAME.0 = alloca %struct.FRAME.nest, align 8 ; <%struct.FRAME.nest*> [#uses=3] %TRAMP.216 = alloca [10 x i8], align 16 ; <[10 x i8]*> [#uses=1] %TRAMP.216.sub = getelementptr [10 x i8]* %TRAMP.216, i32 0, i32 0 ; <i8*> [#uses=1] %tmp3 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 0 ; <i32*> [#uses=1] store i32 %n, i32* %tmp3, align 8 %FRAME.06 = bitcast %struct.FRAME.nest* %FRAME.0 to i8* ; <i8*> [#uses=1] %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 ) ; <i8*> [#uses=1] %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1 ; <i32 (i32)**> [#uses=1] %tmp89 = bitcast i8* %tramp to i32 (i32)* ; <i32 (i32)*> [#uses=2] store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8 %tmp13 = call i32 @process( i32 (i32)* %tmp89 ) ; <i32> [#uses=1] ret i32 %tmp13 } First note that the nested function f has been turned into @f.934 and all of a sudden has an extra parameter %CHAIN.1 This parameter is used to pass in the arguments and variables (such as n) of the parent function nest. If you look closely you will see that it is a pointer to a struct that holds both n and a function pointer suitable for calling f (in more complicated situations this function pointer needs to live in this frame struct, but here it would have been OK to simply be another variable on nest's stack). This function pointer is %tramp, the result of the @llvm.init.trampoline call. It is %tramp that is passed to process, after being bitcast to the right type. Now take a look at this part of nest after inlining @process: %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 ) ; <i8*> [#uses=1] %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1 ; <i32 (i32)**> [#uses=1] %tmp89 = bitcast i8* %tramp to i32 (i32)* ; <i32 (i32)*> [#uses=2] store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8 %tmp2.i = call i32 %tmp89( i32 1 ) ; <i32> [#uses=1] The use of a trampoline is now pointless: since we are inside nest the extra argument to f is directly available. So we could rewrite the "call i32 %tmp89( i32 1 )" with a direct call to f. This is what the new optimization does. Here is the result: %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 ) ; <i8*> [#uses=1] %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1 ; <i32 (i32)**> [#uses=1] %tmp89 = bitcast i8* %tramp to i32 (i32)* ; <i32 (i32)*> [#uses=1] store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8 %nest = bitcast i8* %FRAME.06 to %struct.FRAME.nest* ; <%struct.FRAME.nest*> [#uses=1] %tmp2.i1 = call i32 @f.934( %struct.FRAME.nest* %nest nest , i32 1 ) ; <i32> [#uses=1] The call to the trampoline has been replaced by a direct call to @f.934, with the extra argument it requires spliced in. Note that the call to init_trampoline has not been removed even though it is now useless. I hope to do something about this later. Let me now explain the inlining problem. First note that in practice the only reason to take a pointer to a nested function is to pass them to some outside routine like process. Since otherwise you could just call the nested function directly. Second, in order to simplify a trampoline call the extra arguments ("n") to the nested function ("f") have to be available at the point of the call. This is only going to happen if process (or whatever) has been inlined into the parent function ("nest"). But inlining is only going to happen if the outside function is small. The Ada containers library expects users to pass in function pointers and in many situations you are obliged to pass in a pointer to a nested function. Since the gcc Ada front-end implements these pointers using trampolines this means that using the container library results in a lot of trampoline calls. (The front-end could have made use of "fat pointers" instead, avoiding trampolines; I hear that they plan to do this at some point exactly because of the cost of trampolines, but for the moment it's trampolines). My hope was that some of these trampoline calls could be eliminated using this optimization. But for that to happen, the routines in the container library have to be small enough to be inlined. They are not. In fact they're rather small in terms of lines of code, but exception handling etc really bulks up the LLVM bitcode making the inliner pass them over. Ciao, Duncan. _______________________________________________ llvm-commits mailing list llvm-commits@cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits