On 7/20/2014 3:10 PM, Dmitry Olshansky wrote:
The computed goto is faster for two reasons, according to the article:
1.The switch does a bit more per iteration because of bounds checking.
Now let's consider proper implementation of thread-code interpreter.
where *code pointer points to an array of addresses. We've been through this
before and it turns out switch is slower because of an extra load.
a) Switch does 1 load for opcode, 1 load for the jump table, 1 indirect jump to
advance
(not even counting bounds checking of the switch)
b) Threaded-code via (say) computed goto does 1 load of opcode and 1 indirect
jump, because opcode is an address already (so there is no separate jump table).
True, but I'd like to find a way that this can be done as an optimization.
I'm certain that forced tail call would work just fine instead of computed goto
for this scenario. In fact I've measured this with LDC and the results are
promising but only work with -O2/-O3 (where tail call is optimized). I'd gladly
dig them up if you are interested.
I'm pretty reluctant to add language features that can be done as optimizations.