Greets,

I've been researching ways to improve Clownfish dynamic method dispatch, and
I wanted to bring an idea to lucy-dev.

Right now, a typical autogenerated method invocation wrapper looks something
like this:

    extern size_t Foo_Add_One_OFFSET;
    static inline int64_t
    Foo_Add_One(Foo *self, int64_t arg1) {
        char *address = (char*)self->vtable + Foo_Add_One_OFFSET;
        Foo_Add_One_t method = (Foo_Add_One_t)address;
        return method(self, arg1);
    }

The technique has served us well, but it's not without its drawbacks.

*   The wrapper source code takes up a lot of space in our headers.
*   There are a fair number of global offset variables.
*   The global offset variables require multiple levels of indirection when
    accessed in position-independent code.
*   A small amount of space is required at each method invocation site for
    the inlined vtable lookup.

Here's an alternative approach:

First, in the headers, declare method invocation wrappers as real functions
rather than define them as static inline functions.

    int64_t
    Foo_Add_One(Foo *self, int64_t arg1);

Next, create a handful of "thunk" wrapper functions -- one per offset --
which do nothing but extract a vtable method at a compile-time offset and
jump to it for a tail call.

    // C

    void
    cfish_thunk_32(Obj *self) {
        char *address = (char*)self->vtable;
        address += 32;
        cfish_method_t method = (cfish_method_t)address;
        method(self);
    }

    # x86-64 assembly

    cfish_thunk_32:
            movq    8(%rdi), %rax   # rax = self->vtable
            addq    $32, %rax       # rax += method_offset
            jmp     *%rax           # tail call

Last, find a way to persuade the dynamic loader to resolve all method
invocation functions at a given offset to a single compiled thunk,
**regardless of the method signature**.

If I've thought this through correctly, the technique should work because
all those method invocation wrappers would have compiled down to exactly the
same tail-call assembly anyway.  The work of setting up the argument list is
already done at the invocation site and it depends on the signature, not the
wrapper code.

The part of this scheme I haven't fully worked out yet is the aliasing and
dynamic loading.  Exactly which thunk each method invocation wrapper symbol
needs to resolve to can't be determined until shared object load time.  I
imagine we'll need to generate platform-specific code for ELF, Mach-O, and
PE -- if it's even possible to make things work everywhere.

I also don't know what impact on speed, if any, such a change would have.
The thunks should occupy less memory in total than the offset variables, and
they can be compiled together in the same shared object so they should
exhibit better locality -- and thus perhaps cache performance will improve
marginally.  If my assessment of the assembly is correct, we save one level
of indirection compared to the current scheme but we add a jump; I don't
know how that might impact pipelining.  My guess is that it's a wash, but it
would be good to run some benchmarks.

In any case, I think the idea may be worth pursuing if only for the sake of
header size and code size.

Marvin Humphrey

Reply via email to