On Tue, May 28, 2013 at 11:38 AM, Marvin Humphrey <[email protected]> wrote: > I've been researching ways to improve Clownfish dynamic method dispatch, and > I wanted to bring an idea to lucy-dev.
Hi Guys -- I've been working on some similar things lately, and might be useful. I've been thinking in terms of x64 Linux, though, and addition to simply being wrong, it's quite possible I'll presume things are the same elsewhere when they are not. > Here's an alternative approach: > > First, in the headers, declare method invocation wrappers as real functions > rather than define them as static inline functions. > > int64_t > Foo_Add_One(Foo *self, int64_t arg1); OK. And the part of "real function" we care about here is that the function exists as an entry in the symbol table? Likely Global, but the exact type will be determined by how we can make things work with each object file format. > Next, create a handful of "thunk" wrapper functions -- one per offset -- > which do nothing but extract a vtable method at a compile-time offset and > jump to it for a tail call. OK. Each thunk takes a single argument "self", and then tail-calls a function at a set offset from this. This saves us one level of indirection. On Mac and Linux, the first (non XMM) argument arrives in register %rdi. On Windows, this will be %rcx. If we leave it and all the other passing registers untouched, the function we call will find it's arguments just as if it was called directly. This might be a helpful page when it comes to making this multiplatform: http://www.agner.org/optimize/calling_conventions.pdf > Last, find a way to persuade the dynamic loader to resolve all method > invocation functions at a given offset to a single compiled thunk, > **regardless of the method signature**. Yes, we "just" need to convince the loader to fill the address value of our class specific value with our fixed offset thunk. On the bright side, the "regardless" part should be easy. I don't think the dynamic loader pays any attention to "signature". In C++ this would be encoded into the method name as part of the name mangling, but for C and Assembly it's just a location. > If I've thought this through correctly, the technique should work because > all those method invocation wrappers would have compiled down to exactly the > same tail-call assembly anyway. The work of setting up the argument list is > already done at the invocation site and it depends on the signature, not the > wrapper code. Exactly, although as Nick points out below you would get better speed with an individual thunk per method. > I also don't know what impact on speed, if any, ... Smaller code is a definite win on modern processors. It's not just the size of the L1 instruction cache, but loops can be executed out of an even smaller op code cache. If you can fit a tight loop within this, it can be a large win. Skipping the indirection should help also. A correctly predicted jump should only add 1 cycle, whereas a read from L1 takes 4. On Thu, Jun 13, 2013 at 9:13 PM, Marvin Humphrey <[email protected]> wrote: > On Tue, May 28, 2013 at 12:39 PM, Nick Wellnhofer <[email protected]> wrote: > Hmm... And what we would really need to do may be even harder: we need to > assign aliases *at runtime*, when the DSO loads. Does it _need_ to do that, or would it be useful to start by aiming for just compile time? > Maybe it's impossible. But it sure is fun to mess around with such wacky > low-level hacks! I think it's definitely possible, it's just a question of how ugly and low level it will have to be. My guess would be it can be quite clean, but will require something different for each platform. Worst case it involves dynamically creating, writing, and loading an object file. >> The program below works for me on OS X when compiled with >> >> cc -O2 -Wl,-alias,_thunk1,_Obj_Add,-alias,_thunk2,_Obj_Sub alias.c -o >> alias Sadly, it took me several times reading through to notice the difference in capitalization, but think I understand the example now. I have a short memory. I presume unintentional, but the implicit 'int' type for Obj_add() shows Marvin's point about signatures being irrelevant. > Thank you for the example. It was enough to get past the hump and create a > proof of concept branch which works on OS X 10.8 (and probably only that > specific OS version). I've committed it as `LUCY-256-thunk-hack1`. I resorted to cutting-and-pasting URL parts, but I think this is the diff from master. https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=a665f5aec756e3973e3633770897b10b26cad875;hp=5890545507e185f3201966dfac733210ae88343a Is there an easier way to diff arbitrary commits via the web interface? > Here's what gdb prints out for the assembler. I don't understand the `push` > and `pop` instructions. > > Dump of assembler code for function cfish_thunk112: > 0x0000000100075120 <cfish_thunk112+0>: push %rbp > 0x0000000100075121 <cfish_thunk112+1>: mov %rsp,%rbp > 0x0000000100075124 <cfish_thunk112+4>: mov 0x8(%rdi),%rax > 0x0000000100075128 <cfish_thunk112+8>: pop %rbp > 0x0000000100075129 <cfish_thunk112+9>: jmpq *0x70(%rax) > 0x000000010007512c <cfish_thunk112+12>: nopl 0x0(%rax) > End of assembler dump. I've been looking at quite a lot of output from GCC (more on why later) and I would propose that the 'push' and 'pop' are just another example of the hundreds of minor optimization bugs that GCC has. It's rare that even a simple function doesn't have some 'junk code' in it. %rbp is a "callee save" register, which means our function has to leave it they way we found it. GCC has decided to save the stack pointer %esp here, probably due to something that was optimized out like the function call replaced by the jump. But I think your intuition is correct, and it is not needed. The more I see GCC's output, the more impressed I am that it ever works. But it almost always does, just not optimally. I'm finding that Intel usually has less junk, although it too has quirks. Quick syntax overview in case it helps anyone. GCC and the GNU tools use "AT&T" syntax, instead of the alternative "Intel" syntax. The most confusing difference between is that the order of the operands is reversed. AT&T is "instruction dest, src". When referring to memory locations, there are several built in "addressing modes". In this case "mov 0x8(%rdi), %rax" means to take look at the spot in memory 0x8 bytes after the value of %rdi and put it into %rax: %rax = *(%rdi + 0x8). There is a very similar instruction "lea" (load effective address) that loads the address rather than the value: "lea 0x8(%rdi), %rcx" is equivalent to %rcx = %rdi + 0x8 (no dereferencing). 'jmp' can go either way, hence the dereferencing '*' in this case. You can also add a second register to the mix to express an additional offset: 0x4(%rax, %rdi) would be the address at (%rax + %rdi + 0x4). You can also use a "scale factor" of {1,2,4,8} that this offset register is multiplied by. The combination of these can often be used to save using a separate add and multiply. These aren't just notations --- the processor supports these internally as part of the mov, lea, call, and jmp instructions. Thus unlike Marvin's top example there is no explicit 'add'. >> A single thunk per offset might be bad for branch prediction. But this could >> be worked around by providing separate thunks for each method. I think this is correct. On current Intel, you lose ~15 cycles if your branch is predicted incorrectly. The prediction is based on the recent history of that particular location in memory. Separate thunks would solve this. > I think we could evaluate that using cachegrind on c/t/test_lucy. I don't think you can get reliable branch prediction information from cachegrind. It uses an ultra simplistic simulated prediction algorithm that is not representative. But 'perf' on Linux makes getting real branch prediction statistics very simple. Long message. Sorry if the "lesson" aspect wasn't needed. Partly it's to help me get it clear in my head. Marvin, could you tell me more about the goal? In particular, can one assume that all the DSO's are loaded at the same time, or would you like to handle the case of a late load changing the hierarchy of earlier loaded modules? --nate
