Hi David, hi Basile,

On 10 July 2015 at 03:53, David Malcolm <dmalc...@redhat.com> wrote:
> FWIW PyPy (an implementation of Python) defaults to using true GC, and
> could benefit from GC support in GCC; currently PyPy has a nasty hack
> for locating on-stack GC roots, by compiling to assembler, then carving
> up the assembler with regexes to build GC metadata.

A first note: write barriers, stack walking, and so on can all be
implemented manually.  The only thing that cannot be implemented
easily is stack maps.

Here's in more details how the PyPy hacks work, in case there is
interest.  It might be possible to do it cleanly with minimal changes
in GCC (hopefully?).

The goal: when a garbage collection occurs, we need to locate and
possibly change the GC pointers in the stack.  (They may have been
originally in callee-saved registers, saved by some callee.)  So this
is about writing some "stack map" that describes where the values are
around all calls in the stack.  To do that, we put in the C sources "v
= pypy_asm_gcroot(v);" for all GC-pointer variables after each call
(at least each call that can recursively end up collecting):


/* The following pseudo-instruction is used by --gcrootfinder=asmgcc
   just after a call to tell gcc to put a GCROOT mark on each gc-pointer
   local variable.  All such local variables need to go through a "v =
   pypy_asm_gcroot(v)".  The old value should not be used any more by
   the C code; this prevents the following case from occurring: gcc
   could make two copies of the local variable (e.g. one in the stack
   and one in a register), pass one to GCROOT, and later use the other
   one.  In practice the pypy_asm_gcroot() is often a no-op in the final
   machine code and doesn't prevent most optimizations. */

/* With gcc, getting the asm() right was tricky, though.  The asm() is
   not volatile so that gcc is free to delete it if the output variable
   is not used at all.  We need to prevent gcc from moving the asm()
   *before* the call that could cause a collection; this is the purpose
   of the (unused) __gcnoreorderhack input argument.  Any memory input
   argument would have this effect: as far as gcc knows the call
   instruction can modify arbitrary memory, thus creating the order
   dependency that we want. */

#define pypy_asm_gcroot(p) ({void*_r; \
        asm ("/* GCROOT %0 */" : "=g" (_r) :       \
         "0" (p), "m" (__gcnoreorderhack));    \
        _r; })


This puts a comment in the .s file, which we post-process.  The goal
of this post-processing is to find the GCROOT comments, see what value
they mention, and track where this value comes from at the preceding
call.  This is the messy part, because the value can often move
around, sometimes across jumps.

We also track if and where the callee-saved registers end up being saved.

At the end we generate some static data: a map from every CALL
location to a list of GC pointers which are live across this call,
written out as a list of callee-saved registers and stack locations.
This static data is read by custom platform-specific code in the stack
walker.

This works well enough because, from gcc's point of view, all GC
pointers after a CALL are only used as arguments to "v2 =
pypy_asm_gcroot(v)".  GCC is not allowed to do things like precompute
offsets inside GC objects---because v2 != v (which is true if the GC
moved the object) and v2 is only created by the pypy_asm_gcroot()
after the call.

The drawback of this "asm" statement (besides being detached from the
CALL) is that, even though we say "=g", a stack pointer will often be
loaded into a register just before the "asm" and spilled again to a
(likely different) stack location afterwards.  This creates some
pointless data movements.  This seems to degrade performance by at
most a few percents, so it's fine for us.

So how would a GCC-supported solution look like?  Maybe a single
builtin that does a call and at the same time "marks" some local
variables (for read/write).  It would be enough if a CALL emitted from
this built-in is immediately followed by an assembler
pseudo-instruction that describe the location of all the local
variables listed (plus context information: the current stack frame's
depth, and where callee-saved registers have been saved).  This would
mean the user of this builtin still needs to come up with custom tools
to post-process the assembler, but it is probably the simplest and
most flexible solution.  I may be wrong about thinking any of this
would be easy, though...


A bientôt,

Armin.

Reply via email to