The `assoc' example helped focus my attention on a long-unsolved issue
with JIT-generated code, where non-tail calls from JIT-generated code
to other JIT-generated code seemed more expensive than they should be.
This effect showed up in `assq' and `assoc' through a high relative
cost for calling `assq' or `assoc' on a short list (compared to calling
the C implementation).
This time, I finally saw what I've been missing: It's crucial to pair
`call' and `ret' instructions on x86. That won't be news to compiler
writers; it's a basic fact that I missed along the way.
When the JIT generates a non-tail call from to other code that it
generates, it sets up the called procedure's frame directly (because
various computed values are more readily available before jumping to
the called procedure). After setting up the frame --- including a
return address --- the target code was reached using `jmp'. Later, the
`ret' to return from the non-tail call would confuse the processor and
caused stalls, because the `ret' it wasn't matched with its `call'.
It's easy enough to put the return address in place using `call' when
setting up a frame, which exposes the right nesting to the processor.
The enclosed table shows the effect on traditional Scheme
microbenchmarks. Improvements of 20% are common, and several improve by
50% or more. It's difficult to say which real code will benefit, but I
think the improvement is likely to be useful.
| fastest | new | old |
cpstack | 329 | 6596 ms | 1 | 1 | 1.09 | 1.04 |
ctak | 332 | 9487 ms | 1.15 | 1.00 | 1 | 1 |
dderiv | 316 | 6080 ms | 1 | 1 | 1.08 | 1.13 |
deriv | 612 | 5667 ms | 1 | 1 | 1.04 | 1.25 |
div | 393 | 6780 ms | 1.02 | 1 | 1 | 1.01 |
dynamic2 | 738 | 857 ms | 1.01 | 1 | 1 | 1.21 |
earley | 470 | 335 ms | 1.11 | 1 | 1 | 1.02 |
fft | 399 | 5287 ms | 1 | 1 | 1.13 | 1.02 |
graphs | 375 | 5570 ms | 1 | 1 | 1.21 | 1.17 |
lattice2 | 303 | 6201 ms | 1 | 1 | 1.15 | 1.49 |
maze2 | 471 | 4662 ms | 1 | 1 | 1.81 | 1.28 |
mazefun | 379 | 9228 ms | 1.20 | 1 | 1 | 1.25 |
nboyer | 427 | 3177 ms | 1 | 1 | 1.77 | 1.08 |
nestedloop | 403 | 7383 ms | 1.24 | 1 | 1 | 1.08 |
nfa | 551 | 7062 ms | 1 | 1 | 1.90 | 1.30 |
nothing | 331 | 0 ms | 1 | 1 | 2.06 | 1 |
nqueens | 315 | 5586 ms | 1 | 1 | 1.45 | 1.18 |
nucleic2 | 1568 | 13194 ms | 1 | 1 | 1.00 | 1.02 |
nucleic3 | 886 | 13791 ms | 1 | 1 | 1.10 | 1.02 |
paraffins | 314 | 5979 ms | 1 | 1 | 2.06 | 1.04 |
puzzle | 394 | 6848 ms | 1.33 | 1 | 1 | 1.05 |
ray | 394 | 12668 ms | 1.04 | 1 | 1 | 1.07 |
sboyer | 382 | 4430 ms | 1 | 1 | 2.42 | 1.21 |
scheme2 | 481 | 244 ms | 1.22 | 1 | 1 | 1.24 |
tak | 301 | 6626 ms | 1.08 | 1 | 1 | 1.86 |
takl | 372 | 7708 ms | 1 | 1 | 1.72 | 1.65 |
takr | 945 | 3529 ms | 1.27 | 1 | 1 | 1.10 |
takr2 | 521 | 3670 ms | 1.08 | 1 | 1 | 1.13 |
triangle | 309 | 6868 ms | 1 | 1 | 1.15 | 1.51 |
_________________________________________________
For list-related administrative tasks:
http://lists.racket-lang.org/listinfo/dev