Re: [lucy-dev] Thunk method wrappers

Nathan Kurz Fri, 14 Jun 2013 03:17:04 -0700

On Tue, May 28, 2013 at 11:38 AM, Marvin Humphrey
<[email protected]> wrote:
> I've been researching ways to improve Clownfish dynamic method dispatch, and
> I wanted to bring an idea to lucy-dev.

Hi Guys --

I've been working on some similar things lately, and might be useful.
I've been thinking in terms of x64 Linux, though, and addition to
simply being wrong, it's quite possible I'll presume things are the
same elsewhere when they are not.

> Here's an alternative approach:
>
> First, in the headers, declare method invocation wrappers as real functions
> rather than define them as static inline functions.
>
>     int64_t
>     Foo_Add_One(Foo *self, int64_t arg1);

OK.  And the part of "real function" we care about here is that the
function exists as an entry in the symbol table?   Likely Global, but
the exact type will be determined by how we can make things work with
each object file format.

> Next, create a handful of "thunk" wrapper functions -- one per offset --
> which do nothing but extract a vtable method at a compile-time offset and
> jump to it for a tail call.

OK.  Each thunk takes a single argument "self", and then tail-calls a
function at a set offset from this.  This saves us one level of
indirection. On Mac and Linux, the first (non XMM) argument arrives in
register %rdi.   On Windows, this will be %rcx.  If we leave it and
all the other passing registers untouched, the function we call will
find it's arguments just as if it was called directly.

This might be a helpful page when it comes to making this
multiplatform: http://www.agner.org/optimize/calling_conventions.pdf

> Last, find a way to persuade the dynamic loader to resolve all method
> invocation functions at a given offset to a single compiled thunk,
> **regardless of the method signature**.

Yes, we "just" need to convince the loader to fill the address value
of our class specific value with our fixed offset thunk.  On the
bright side, the "regardless" part should be easy.   I don't think the
dynamic loader pays any attention to "signature".   In C++ this would
be encoded into the method name as part of the name mangling, but for
C and Assembly it's just a location.

> If I've thought this through correctly, the technique should work because
> all those method invocation wrappers would have compiled down to exactly the
> same tail-call assembly anyway.  The work of setting up the argument list is
> already done at the invocation site and it depends on the signature, not the
> wrapper code.

Exactly, although as Nick points out below you would get better speed
with an individual thunk per method.

> I also don't know what impact on speed, if any,  ...

Smaller code is a definite win on modern processors.  It's not just
the size of the L1 instruction cache, but loops can be executed out of
an even smaller op code cache.  If you can fit a tight loop within
this, it can be a large win.   Skipping the indirection should help
also.   A correctly predicted jump should only add 1 cycle, whereas a
read from L1 takes 4.

On Thu, Jun 13, 2013 at 9:13 PM, Marvin Humphrey <[email protected]> wrote:
> On Tue, May 28, 2013 at 12:39 PM, Nick Wellnhofer <[email protected]> wrote:

> Hmm...  And what we would really need to do may be even harder: we need to
> assign aliases *at runtime*, when the DSO loads.

Does it _need_ to do that, or would it be useful to start by aiming
for just compile time?

> Maybe it's impossible.  But it sure is fun to mess around with such wacky
> low-level hacks!

I think it's definitely possible, it's just a question of how ugly and
low level it will have to be.   My guess would be it can be quite
clean, but will require something different for each platform.  Worst
case it involves dynamically creating, writing, and loading an object
file.

>> The program below works for me on OS X when compiled with
>>
>>     cc -O2 -Wl,-alias,_thunk1,_Obj_Add,-alias,_thunk2,_Obj_Sub alias.c -o 
>> alias

Sadly, it took me several times reading through to notice the
difference in capitalization, but think I understand the example now.
I have a short memory.  I presume unintentional, but the implicit
'int' type for Obj_add() shows Marvin's point about signatures being
irrelevant.

> Thank you for the example.  It was enough to get past the hump and create a
> proof of concept branch which works on OS X 10.8 (and probably only that
> specific OS version).  I've committed it as `LUCY-256-thunk-hack1`.

I resorted to cutting-and-pasting URL parts, but I think this is the
diff from master.
https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=a665f5aec756e3973e3633770897b10b26cad875;hp=5890545507e185f3201966dfac733210ae88343a
Is there an easier way to diff arbitrary commits via the web interface?

> Here's what gdb prints out for the assembler.  I don't understand the `push`
> and `pop` instructions.
>
> Dump of assembler code for function cfish_thunk112:
> 0x0000000100075120 <cfish_thunk112+0>: push   %rbp
> 0x0000000100075121 <cfish_thunk112+1>: mov    %rsp,%rbp
> 0x0000000100075124 <cfish_thunk112+4>: mov    0x8(%rdi),%rax
> 0x0000000100075128 <cfish_thunk112+8>: pop    %rbp
> 0x0000000100075129 <cfish_thunk112+9>: jmpq   *0x70(%rax)
> 0x000000010007512c <cfish_thunk112+12>: nopl   0x0(%rax)
> End of assembler dump.

I've been looking at quite a lot of output from GCC (more on why
later) and I would propose that the 'push' and 'pop' are just another
example of the hundreds of minor optimization bugs that GCC has.  It's
rare that even a simple function doesn't have some 'junk code' in it.

%rbp is a "callee save" register, which means our function has to
leave it they way we found it.  GCC has decided to save the stack
pointer %esp here, probably due to something that was optimized out
like the function call replaced by the jump.  But I think your
intuition is correct, and it is not needed.   The more I see GCC's
output, the more impressed I am that it ever works.  But it almost
always does, just not optimally.   I'm finding that Intel usually has
less junk, although it too has quirks.

Quick syntax overview in case it helps anyone.  GCC and the GNU tools
use "AT&T" syntax, instead of the alternative "Intel" syntax.   The
most confusing difference between is that the order of the operands is
reversed.  AT&T is  "instruction dest, src".

When referring to memory locations, there are several built in
"addressing modes".  In this case "mov 0x8(%rdi), %rax" means to take
look at the spot in memory 0x8 bytes after the value of %rdi and put
it into %rax: %rax = *(%rdi + 0x8).

There is a very similar instruction "lea" (load effective address)
that loads the address rather than the value:  "lea 0x8(%rdi), %rcx"
is equivalent to %rcx = %rdi + 0x8 (no dereferencing).  'jmp' can go
either way, hence the dereferencing '*' in this case.

You can also add a second register to the mix to express an additional
offset: 0x4(%rax, %rdi) would be the address at (%rax  + %rdi  + 0x4).
 You can also use a "scale factor" of {1,2,4,8} that this offset
register is multiplied by.

The combination of these can often be used to save using a separate
add and multiply.  These aren't just notations --- the processor
supports these internally as part of the mov, lea, call, and jmp
instructions.  Thus  unlike Marvin's top example there is no explicit
'add'.

>> A single thunk per offset might be bad for branch prediction. But this could
>> be worked around by providing separate thunks for each method.

I think this is correct.  On current Intel, you lose ~15 cycles if
your branch is predicted incorrectly.   The prediction is based on the
recent history of that particular location in memory.   Separate
thunks would solve this.

> I think we could evaluate that using cachegrind on c/t/test_lucy.

I don't think you can get reliable branch prediction information from
cachegrind.  It uses an ultra simplistic simulated prediction
algorithm that is not representative.  But 'perf' on Linux makes
getting real branch prediction statistics very simple.

Long message.  Sorry if the "lesson" aspect wasn't needed.  Partly
it's to help me get it clear in my head.

Marvin, could you tell me more about the goal?  In particular, can one
assume that all the DSO's are loaded at the same time, or would you
like to handle the case of a late load changing the hierarchy of
earlier loaded modules?

--nate

Re: [lucy-dev] Thunk method wrappers

Reply via email to