Register allocation/spilling and register volitility

Jeff Clites Mon, 08 Nov 2004 01:04:28 -0800

From other threads:

Now we are placing arguments or return values in registers according to PDD03 and the other end has immediately access to the placed values, because the register file is in the interpreter.

With the indirect addressing of the register frame, this argument passing is probably the most costly part of function calls and returns, because we have to copy the values.

and

Anyway, if you can pop both register frames -and- context structures, you won't run GC too often, and everything will nicely fit into the cache.
I thought about that too, but I'd rather have registers adjacent, so that the caller can place function arguments directly into the callers in arguments.

OTOH it doesn't really matter, if the context structure is in the frame too. We'd just need to skip that gap. REG_INT(64) or I64 is as valid as I0 or I4, as long as it's assured, that it's exactly addressing the incoming argument area of the called function.

A problem with this is that you can't be sure that you can actually have the "next" frame of registers adjacent to the current frame--they might already be taken. Imagine A calls B, then B creates a continuation and stores it in a global, then returns. If A makes another function call, it can't just use the adjacent register frame--it's still "taken". (I'm referring here to the "sliding register window" idea of course.) But this is reminding me of another idea I've had.

I've been thinking for a while about another idea to decrease some of the copying which we are now doing for function calls, as well as the allocation of register sets during subroutine calls, and which would as a side-effect allow us to reduce the need for register spilling.

First, a code snippet. Consider the following C code:

int a(int x) { return b(x); }
int b(int x) { return c(x + 1); }
int c(int x) { return x + 7; }

As compiled on the PPC, this code is very compact--each function is implemented by just one or two instructions, and only one register is used. There are two key factors here: (1) no register preservation is needed, and (2) the call from a() to b() is just a branch, since the call to a() has already loaded the registers with the correct values for the call to b().

By my thinking, register usage falls into three basic categories:

1) Registers used for parameter passing and returns values for function calls.

2) Registers used to hold values which need to be preserved across function calls.

3) Registers used to hold values which do not need to be preserved across function calls.

Really, (1) and (3) are similar--in either case, you have registers whose values are allowed to change across function calls. (For register which hold return values that's obvious, and for those used to pass parameters, it seems fair that these be expected to change across a function call.) So we really have two cases--volatile and non-volatile register usage.

In the PPC ABI, half the registers are treated as volatile, and the other half as non-volatile. For the PPC, this corresponds to caller-preserved v. callee-preserved. Because of continuations[1], parrot can't have callee-preserved registers, but it could still have volatile v. non-volatile registers. Here's what I'm thinking:

Keep the old-scheme registers inside the interpreter structure, *as well as* the new indirect registers. Call the registers in the interpreter I0 to I31, and the indirect registers I32 to whatever.[2] I'll call these the "direct" and "indirect" registers. By their very nature, the direct registers are volatile, and the indirect registers are non-volatile. What does this buy us? The following:

1) Parameter passing and return occur via the established calling conventions, in the volatile registers. No extra copying is needed upon function call or return--the volatile registers are physically the same for the caller and the callee.

2) "Temporary" calculations can use the volatile registers. Values which need preserving can use the non-volatile registers directly.

3) In functions which need no non-volatile registers, there's no need to allocate the indirect registers at all.

4) Cases like my example code above can compile just as compactly as the PPC case. With the volatile Parrot registers mapped to volatile CPU registers (in the PPC case at least), this would be highly efficient: you'd end up with asm similar to what you have in the C case above, and you'd not have a need to save the CPU registers back to Parrot registers (other than those used in the calling conventions) for Parrot function calls out of JIT code.

5) Because this opens things up to more than 32 registers, code could use as many registers as needed. You'd never have register spilling per-se in order to accommodate local variables, though you'd still want a smart register allocator which minimized the number of registers needed (for memory locality as well as minimizing allocation). But a very simple one-non-volatile-register-per-local would function correctly, as a baseline implementation (correct but slow).

The main value here is in the ability to avoid allocation of indirect registers in many cases, minimize copying, and simplify the register allocation policy.

Further notes:

1) For the JIT case, correct code can be emitted to handle direct v. indirect registers (relative to the interpreter pointer, or relative to the new base pointers, depending on the register). For the other cores, we'd either need to generate additional op variants to handle direct v. indirect registers (probably lots more op variants), or we'd need to have register-accessing code be conditional on the register number (slow--lots of branches). The latter approach doesn't bother me for the non-prederef cores (they're slow anyway, and mostly good for testing), but I don't know enough about the prederef cores to see if we could do something efficient, without an op explosion.

2) Since the number of indirect registers could vary from function-to-function, we'd need a syntax to specify how may are needed for a given function--probably an op to allocate the right number of registers, *or* it could somehow be metadata kept as part of the sub itself. With the op approach, for safety upon function call we could swap into the indirect register pointer, a pointer to a zero-filled read-only chunk of memory, to catch attempts to use indirect registers before allocating any. But we'd still have to deal with the issue of how to catch trying to access more registers than you actually allocated. A compiler could ensure that this would never happen, but it would feel dangerous if the VM didn't have a way to catch this, for hand-generated assembly. (On the other hand, a bytecode verifier could catch it.)

3) Really, only the direct registers are "real" registers, philosophically--they're the only ones which live at fixed memory locations. The indirect registers are fulfilling a purpose that other VMs (or calling conventions) would use "the stack" for. By calling them registers, we can use them directly for calculations, but still pretend that we are RISC.

4) This may not be directly relevant to the "slow fib benchmark" problem, *except* that for tail calls this does speed things up because, by definition, you don't need any register preservation for tail calls. This is true (and an advantage) even without actual tail-call optimizations (ie, even if you have the usual frame overhead, at least you don't need to allocate indirect registers). And I believe there's a theorem that all recursive operation can be re-written to be tail-recursive, so it's sort of relevant, though I guess what's important about the fib benchmark isn't specifically that it's recursive, but rather just that it creates a lot of frames (lots of function calls, deep call depth).

Comments of course welcome, and encouraged. Thanks if you read all of that--it ended up longer than I was expecting.

JEff

[1] Callee-preserved means "callee saved and then restored just before return". With continuations, you can jump past the restoration code, and ruin this strategy.

[2] These could be numbered starting from 32, or they could be referred to via another notation indexed from 0--NVI0 for "non-volatile I register 0", or something.

Register allocation/spilling and register volitility

Reply via email to