On Mon, Jul 15, 2013 at 4:50 AM, Nick Wellnhofer <[email protected]> wrote:
> Micro-benchmarks are always a bit dangerous, but it seems that on modern
> CPUs our current implementation of method dispatch is really fast and
> probably hard to beat. I think a loop with a single method call took about
> 6-7 cycles per iteration (Ivy Bridge, x64). This is surprising since
> computing the method address alone requires three memory loads with a
> latency of 4-5 cycles each (the third load depending on the first two). But
> because of branch prediction, the CPU can take a speculative branch
> immediately without having to wait for the memory loads. The branch target
> is validated later, so the loads can be pipelined.
Here's an Intel Xeon E5430 from a few years ago running 32-bit CentOS 5.0:
http://en.wikipedia.org/wiki/Xeon#5400-series_.22Harpertown.22
$ make -f Makefile.linux
LD_LIBRARY_PATH=. ./exe
cycles/call with method ptr loop: 11.226522
cycles/call with wrapper loop: 14.781987
cycles/call with fixed offset wrapper loop: 10.538704
cycles/call with wrapper: 19.622476
cycles/call with simulated inline: 7.702859
Here's a more recent Intel Xeon E5620 running 64-bit CentOS 5.5:
http://en.wikipedia.org/wiki/Xeon#3600.2F5600-series_.22Gulftown.22_.26_.22Westmere-EP.22
$ make -f Makefile.linux
LD_LIBRARY_PATH=. ./exe
cycles/call with method ptr loop: 7.014678
cycles/call with wrapper loop: 7.016887
cycles/call with fixed offset wrapper loop: 7.014423
cycles/call with wrapper: 10.520168
cycles/call with simulated inline: 2.339327
What's interesting about those results is that on the modern CPU the
micro-benchmark yields essentially identical results for C++ style fixed
offset vtable dispatch, Clownfish-style variable offset vtable dispatch, or a
saved raw function pointer, while on the older CPU the Clownfish-style
variable offset dispatch performs slightly worse.
For what it's worth, the Clownfish "inside-out vtable" design is similar to
the techniques described by Dachuan Yu et al in 2002 -- benchmarks here:
https://www.usenix.org/legacy/events/javavm02/yu/yu_html/node29.html
> My guess is that most of the time this happens, there will be another
> non-compatible API change anyway.
I would argue that there is still a large benefit in user interface simplicity
by making it impossible to break the ABI without also breaking the API.
Eliminating this last quirk is a big deal because it substantially reduces the
knowledge and mental effort required to write ABI-compatible code.
> OK, but this will mean to add MethodSpec structs for every method of a
> class. I think it's best to use separate structs for novel, overridden, and
> inherited methods then.
I realize that may be a lot of code but until load-time latency becomes a
problem, I think it's an acceptable implementation strategy.
Would you like to work on this, or would you like me to take it on?
Marvin Humphrey