Re: [PATCH] content preserving register pushes

Michael Maraist Mon, 08 Oct 2001 12:55:25 -0700


> Currently, instead of pushing the contents of fixed registers onto the
> stack, the registers themselves float on top of the stack.  That doesn't
> preserve register contents on a push_x, as the registers are now in a new
> memory location.  After toying with fixed registers per interpreter, I
> decided that it probably wasn't worth it in the long haul.  (Since it's
> doubtful, given the size of our register sets, that this would need to be
> done often.)  Instead, I'm proposing a parallel set of data-preserving
> push_x ops, save_x.  Patch after .sig.
>


The biggest problem that you have with save_x is how to retrieve values when
you do a pop_x.  Most any nested block / subroutine call needs to return at
least one value.  One possibility is via a consolidate op-code which is just
like pop but accepts a single register to preserve.

AUTO_OP preserve_i {
  int tmp = INT_REG(P1);
  Parrot_pop_i(interp);
  INT_REG(P1) = tmp;
}

But that only works when you have exactly 1 preserved reg; not always the
case. Especially with nested blocks of code which might want to modify
several of it's enclosed scope-variables.

My point is that by itself save_i is useless except for possible
side-effects (such as outputting).

I've suggested before a SPARC-like register window.  The relavent part here
is that you can do a partial push / pop of half the register register-set.
This can be thought of as dividing the register set into Input / Output
registers.  A given scope uses all registers.  But when it comes time to
enter a new scope they have to make sure that all output parameters are in
the upper half, then perform a half-push.  When leaving a scope, they have
to make sure that all returnable parameters are in the lower half, then
perform a half-pop.  One advantage is that this is faster than a save_i
instruction since only half the regs are saved.  Additionally we're not
duplicating ALL the registers, just those that we're not using in the nested
scope.  Saves (num_regs/2)*sizeof(reg) memory on each new scope (including
function calls).

The only complex change this would require in the existing code-base is that
register sets couldn't just be linked-lists.  They'd have to exist as arenas
of multiple contiguous reg-sets.  The current reg set would consist of an
arena pointer and an offset into that arena (cached by the reg-set pointer).
If you're at the end of a memory arena, for example, a half-push would
require allocating a new arena with num_regs*num_frames_in_arena then
copying num_regs/2 registers from the top of the old arena to the bottom of
the new arena since a register set has to be contiguous.  Full pushes
shouldn't be affected; if there is only a half-frame left in the current
arena, then it can just skip it and allocate a new arena.

Half-pushes for subroutines are great since you can optimize small
subroutines to work within the limited free "out-regs".  So the following
code:
sub foo($x : int, $y : int ) : int # equiv to C "int foo(int,int)"
{
  return  $x + $y
}
$::z = foo($::x, $::y);

can be:
foo:
  add I16, I16, I17
  return

main:
 ld I16, [x]
 ld I17, [y]
 call foo
 st [z], I16

Additionally:
sub bar($x : int ) : int
{
  return foo( $x, 5 );
}
$::z = bar($::x)

becomes:
bar:
  half_push_i # I16 is now I1, I17 is now I2
  mov I16, I1 # I1 -> I16
  set I17, 5
  call foo
  mov I1, I16 # save output of foo to our output reg
  return
main:
  ld I16, [x]
  call foo
  st [z], I16

Note that in this case we could have additionally optimized bar to:

bar:
  set I17, 5 # append to parameter stack
  call foo
  return

Note this only works when the parameter numbers are known.  We'd still have
to resort to something else for dynamic parameters.  I'm partial to an
external general-purpose stack.  For anonymous invocation

$dispatcher{ foo => \&foo )

$val =<>;
@params = <>;
$dispatcher{ $val }->( @params );

You'd have to have a generic wrapper function that performed bounds-checking
for the prototype.

foo_generic:
  GetStackFrameSize I16 # SP-FP -> I16
  ne I16, 2, foo_bad_num # invalid num of parameters
  POPStack I16 # throws exception if not type int
  POPStack I17


  call foo # hopefully storing on separate return stack since it's not
           # useful as a register??? Else we need to save reg-stack
  PushStack I16
  return

foo_bad_num:
  exception "invalid num params"
  end
  return

main:
  push_p
  push_s
  PushStackFrame

  # $dispatcher{ foo => \&foo )
  newHash P1
  ld P2, ["foo_generic"]  # CV
  SetHash P1, "foo", P2

  # $val =<>;
  readln S1, 0 # read stdin

  # @params = <>;
  newArray P3
  readArray P3, 0 # gulp rest into @P3 (@params)

  HashFetch  P4, P1, S1  # P4 is presumably CV

  PushStackFlatten P3 # takes each el from P3 and puts on stack
  call foo_generic # assumes X16..X32 are volitile
  PopStackFrame

##################

I had originally wanted a complete emulation of the SPARC reg-stack, since
it was very elegant.  You'd have 2 separate register stacks.  You'd what
I've described above (the IO/stack).  And a global stack for use in
module-specific globals that weren't exported.  Such as "my int $foo".  This
could also accomodate variables that were global across all nested scopes of
a given function.  You would use the IO-stack to pass parameters to and from
the nested scopes.  Ex:

package Foo;
my int $x;
sub bar(@a : int) {
  my int $state;
  my int $tmp;
  ... # uses $tmp
  for my int $el ( @a ) {
    my int $k;
    ... # uses $state. Doesn't use $tmp
  }
}

Here the PMC IO stack would contain: P16 = @a
The int IO stack would contain $el, $k, $tmp etc (all temporaries)
The int global stack would contain $state since the IO might grow and shrink
often but will always need this value.

Note that if a subroutine needs any "global" values, then it pushes the
global reg-stack (e.g. there is no half-push for it).  If more than num_regs
is required, then the IO-stack can be used as an intermidiary for
transferring values from each stack-frame.

Note also that "my int $x;" is in an odd position.  If it were a constant
than the op-codes would relegate it to Parrot_int_constant[x].  If it were a
global "our int $x" then it would require stash lookups (which I've
previously denoted as "ld In, [x]" where x is probably the string-constant
index.  My guess is that all such references to local-package variables can
be "cached" locally into a register.  To do this requires that a read be
performed at the beginning of a block, and a write be performed either after
each change or more efficiently just prior to any function-calls /
end-of-scope unless it's known that the value hasn't been modified.  If $x
was used in the outer-most scope, then it might be a pain to continuously
push it's cached value up the scope-chain.  Hense, it could be allocated a
spot in the global register-set.  The benifit is exemplified below:

# non-optimized compilation

blockA:
  # compiled assigned I16 to $x
  ld I16, [x]
  # use I16
  st I16, [x]
  # do more stuff like call a function which modifies $x
  ld I16, [x]
  # use I16
  st I16, [x]
blockAA:
  half_push_i # I16 is now I1.  Wanted more registers for nested scope
  # do some stuff
blockAAA:
  half_push_i # needed still more registers
  # compiler assigned I18 to $x
  ld I18, [x]
  # use I18
  st I18, [x]
  # do more stuff like call a function that modifies $x
  half_pop_i
blockAA:
  # more stuff
  half_pop_i
blockA:
  ld I16, [x] # using originally assigned location
  ...
  st I16, [x]

Note that for fast compilation, each access to "x" is wrapped around a ld
/st pair.  Subsequent optimization passes consolidate this into a minimal
number of st..ld segments (ideally having a single ld..st pair for the
entire function).

Note here that we had to waste two register addresses on $x.  A global
register set could have have just permanently mapped $x to some value.
Nested scopes wouldn't have needed as many registers (due to variable
duplication) and possibly could have avoided some of the half-pushes.  Note
that without a half-push operation each nested scope would have been
horrendous, since the compiler would have to remap all values non-trivially
(not just by subtracting some a-priori value from existing
register-mappings).  Then some complicated method would be required in order
to pass values back.  In the case of $x, the nested scope could have just
saved the value prior to poping, then reload it later on.  But this prevents
optimizations where a nested scope doesn't call external code, which could
otherwise have reduced the number of "st/ld" pairs, thus the external
copying method would still be required.

Unfortunately for a dual register set to work (IO and global), one of two
methods is required.  Either provide a ld/st operation for global values OR
map the globals and local-IO into one address space accessible by all
register-based op-codes.  The ld/st approach reduces the usefullness of such
globals since you're merely replacing a (ld int-reg, string-const) with (ld
int-reg, global-reg).  At this point, the usefullness of a function-specific
global-reg-set is minimal and the global-reg just becomes the resultant
calculation of the stash-lookup.  That could have been an optimization for
the lookup anyway (since the string-id could also be an index into the
package-local stash-table).  The overhead of the extra op-code could mask
any benifits.  Plus you still have to reserve temporary local-registers
(first-pass compilers don't consolidate register mappings; they permanently
assign each variable instance to the next higher regiseter-id)

The more interesting approach of merging multiple operations into a common
address space has several implications.  First, no matter how you do it, it
will require some sort of operation in the INT_REG() macro that performs the
mapping.  This will slow down the common-code line.  However, I submit that
eventually we'll need a bounds-checker on parameters.  The JIT can map
parameters into an n-bit parameter such that it's impossible to provide an
out-of-bounds number. (ex, for add_r_r_r, map 15 bits to the parameters, 5
bits maps to 0..31. )  The bounds checking prevents hacked code from
core-dumping you.  I'd recommend that parrot never core-dumps, but instead
throws happy exceptions which can be caught (say by an eval).  Thus at a
minimum we'd need:

#define INT_REG(x) ((unsigned)x<NUM_REGS ? interp->int_regs[x] :
                    INTERNAL_ERROR("invalid-register-range") )

Note that INTERNAL_ERROR would have to return 0 to avoid core-dumping if it
didn't immediately farjump (which it doesn't look like it's going to do).
Unfortunately even that isn't safe because it allows us to garble the value
of r0 on a write which might be needed by the exception handler.
>From this we could define:

#define READ_REG_INT(p) ((unsigned)p<NUM_IO_REGS? interp->io_int[p] : \
                         (unsigned)p<NUM_IO_REGS+NUM_GLOBAL_REGS? \
                           interp->global_int[p - NUM_IO_REGS] : \
                         p==-1 ? 0 : \
                         p==-2 ? 1 : \
                         p==-3 ? -1 : \
                         INTERNAL_EXCEPTION("register-out-of-bounds"))

#define ASSIGN_REG_INT(p) ((unsigned)p<NUM_IO_REGS? interp->io_int[p] : \
                           (unsigned)p<NUM_IO_REGS+NUM_GLOBAL_REGS? \
                             interp->global_int[p-NUM_IO_REGS] : \
                           INTERNAL_EXCEPTION( \
                             "write-register-out-of-bounds"))



This renames the accessor macros but allows us to use constants as register
values.  Thus I-1 really equals zero, I-2 == 1, I-3 == -1.  These three
primary constants allow
r1 += r2
to become
r1 = zero + r2

r1 *= r2
to become
r1 = one * r2

r1 = -r2
to become
r1 = neg_one * r2

Now the above have _r_r_i varients, so 0, 1, -1 could easily have been the
constant values of r3.  But, firstly the floating point values have
non-trivial constant-accesses.  Secondly, it prevents us from arbitrarily
modifying a register access (either via debug or for faster compiles).  We
don't have any r_i_i ops, thus if r3 was already a constant, we couldn't
modify the operation to make r2 a simple [primary] constant.

You could carry this further and build-in several constants (-4 => pie, -5
=> lg2(e), etc ).  But unless there's justification, we're making the
code-base larger (thereby hurting cache performance).

The only transparent way of preventing garbling register values on an
bounds-exception is insert a dummy regiseter into the register-stack. We'd
essentially subtract one from all logical register values (assuming
I0..I31).  A read-parameter of zero is mapped to the constant 0, 1 is mapped
to reg[1], ..., 100 is mapped to a bounds-exception and reg[0].  A
write-parameter of zero is mapped to reg[0] (which can never be read from).
A write-bounds-exception is also mapped to reg[0].  Thus the code is
probably something like
' p == constX ? constXVal : \
  reg[ INTERNAL_EXCEPTION(".."), 0 ]'

Re: [PATCH] content preserving register pushes

Reply via email to