[racket-dev] better x86 performance

Matthew Flatt Sun, 24 Apr 2011 16:11:53 -0700

The `assoc' example helped focus my attention on a long-unsolved issue
with JIT-generated code, where non-tail calls from JIT-generated code
to other JIT-generated code seemed more expensive than they should be.
This effect showed up in `assq' and `assoc' through a high relative
cost for calling `assq' or `assoc' on a short list (compared to calling
the C implementation).


This time, I finally saw what I've been missing: It's crucial to pair
`call' and `ret' instructions on x86. That won't be news to compiler
writers; it's a basic fact that I missed along the way.

When the JIT generates a non-tail call from to other code that it
generates, it sets up the called procedure's frame directly (because
various computed values are more readily available before jumping to
the called procedure). After setting up the frame --- including a
return address --- the target code was reached using `jmp'. Later, the
`ret' to return from the non-tail call would confuse the processor and
caused stalls, because the `ret' it wasn't matched with its `call'.
It's easy enough to put the return address in place using `call' when
setting up a frame, which exposes the right nesting to the processor.

The enclosed table shows the effect on traditional Scheme
microbenchmarks. Improvements of 20% are common, and several improve by
50% or more. It's difficult to say which real code will benefit, but I
think the improvement is likely to be useful.

	fastest		new		old
cpstack	329	6596 ms	1	1	1.09	1.04
ctak	332	9487 ms	1.15	1.00	1	1
dderiv	316	6080 ms	1	1	1.08	1.13
deriv	612	5667 ms	1	1	1.04	1.25
div	393	6780 ms	1.02	1	1	1.01
dynamic2	738	857 ms	1.01	1	1	1.21
earley	470	335 ms	1.11	1	1	1.02
fft	399	5287 ms	1	1	1.13	1.02
graphs	375	5570 ms	1	1	1.21	1.17
lattice2	303	6201 ms	1	1	1.15	1.49
maze2	471	4662 ms	1	1	1.81	1.28
mazefun	379	9228 ms	1.20	1	1	1.25
nboyer	427	3177 ms	1	1	1.77	1.08
nestedloop	403	7383 ms	1.24	1	1	1.08
nfa	551	7062 ms	1	1	1.90	1.30
nothing	331	0 ms	1	1	2.06	1
nqueens	315	5586 ms	1	1	1.45	1.18
nucleic2	1568	13194 ms	1	1	1.00	1.02
nucleic3	886	13791 ms	1	1	1.10	1.02
paraffins	314	5979 ms	1	1	2.06	1.04
puzzle	394	6848 ms	1.33	1	1	1.05
ray	394	12668 ms	1.04	1	1	1.07
sboyer	382	4430 ms	1	1	2.42	1.21
scheme2	481	244 ms	1.22	1	1	1.24
tak	301	6626 ms	1.08	1	1	1.86
takl	372	7708 ms	1	1	1.72	1.65
takr	945	3529 ms	1.27	1	1	1.10
takr2	521	3670 ms	1.08	1	1	1.13
triangle	309	6868 ms	1	1	1.15	1.51

_________________________________________________
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev

[racket-dev] better x86 performance

Reply via email to