Re: [bitc-dev] Object forwarding

Jonathan S. Shapiro Wed, 16 Oct 2013 13:02:57 -0700

OK. This is another design write-through, this time concerning options for
object forwarding in a concurrent collector.



I've been thinking a lot about forwarding, and it's really bugging me.
Object relocation (for whatever reason) is really the only thing making it
hard to run mutators and collectors concurrently. If we can work out how to
do that with tolerable overhead, a lot of things become easier. I keep
thinking that there must be a trick that I'm not seeing. But I don't think
there is. Bejamin Zorn's dissertation is pretty comprehensive, though the
lazy read barrier idea has some appeal. Heh. I'd forgotten about his work
(approaching 50 - the brain rots), and I should probably just give Benjamin
a call.

The options are pretty clear. We can either add a read barrier to most
object reference loads, or we can use a page fault and require that every
object reference load instruction must be a safe point, or we can use
dynamic translation. Or we can of course use some combination. But there is
some flexibility about *when* to add the read barriers, and I haven't seen
that part discussed in any of the literature. Let me run through this.

I've been considering the problem of background evacuation in old space.
The techniques I'm about to describe will certainly *work* for the nursery,
but their overhead seems pretty high.

*Issues*
*
*
The issues here are:

   1. How frequently are we relocating? That is: what percentage of the
   time are mutators *potentially* hazarded by object relocation "faults"?
   It would be a shame to carry around the code size and cost of a barrier
   mechanism if we are only using it 0.1% of the time. (Term: *barriered
   execution ratio*). Is relocation a separate collector phase, or does it
   happen continuously?
   2. What is the dynamic *likelihood* of actually referencing an object
   that is being relocated? Assuming we can come up with a workable mechanism
   to allow the mutator to proceed after a read barrier "fault", we become
   progressively more willing to shift the cost of the read barrier to a
   higher cost mechanism (e.g. a page fault) that is dynamically rare. If we
   can manage it, we'd like to avoid evacuating hot objects. (Term: *hazarded
   reference ratio*)
   3. I suspect that we *really* want to be clearing a known address range
   whose size is some power of two, where *all* objects in that region are
   being evacuated. It gives a broader range of read barrier encodings. Also,
   if we have to check the target of the reference for a "forwarded" bit, we
   end up polluting the cache with a line from an object that isn't actually
   there. The power of two requirement comes from a desire to reduce the
   number of instructions in the read barrier, and also from the ability to
   use the paging mechanism to assist us (at least on some machines).

Have I missed any big issues?

*Read Barriers*

If the read barrier is implemented using code that is permanently emitted
in the mutator, the simplest read barrier implementations turn object
reference loads into three instructions:

load (object reference value from memory)

test (check for forwarding)

jump-not-taken  recovery path


The barrier fence value almost certainly ends up living in a register,
because object reference loads are frequent. This means that there are two
costs to the mechanism:

   - A code bloat cost of about 16%. I'm assuming that 8% of instructions
   are object reference loads. Could be a little more or a little less.
   - Effective loss of a hardware register. The impact of this on modern
   machines isn't that big. On older Pentiums it would suck.

The register loss can be eliminated if we are prepared to rewrite the code
on the fly (e.g. by patching a barrier value), but if barrier hazards are
frequently in effect we'll end up doing an awful lot of rewriting. This
would be an example of a problem for which dynamic binary translation is
well suited.

Note that on out-of-order machines the cost of this is *just* code bloat.
The run-time cost of the extra instructions will happily disappear into the
out-of-order retirement queues. On single issue machines (notably older ARM
machines and other current embedded cores), the cost of this sequence is
high.

Note also that read barriers do not need to occur on every object reference
load. If we know (statically) that we are going to read a reference from
the next object down, we can (at static compile time) direct the mutator to
a recovery path that does a forwarding sweep on the target object. This
requires some rendezvous assumptions between mutator and collector, but
nothing dramatic and nothing that requires a global pause for
synchronization. The point is that some read barriers serve as guards that
make other read barriers unnecessary, which reduces the total code and
performance impact. This optimization can be combined with the dynamic
translation optimization below.


*Page Faults*

The attractions of the page fault approach are:

   - Runtime cost is incurred only by hazarded references that are
   dynamically visited.
   - Multiple blocks can be evacuated simultaneously.
   - No code bloat.

The disadvantage is that on most operating systems the cost of the page
fault transition is *so* high as to overwhelm these attractions. Still, if
the hazarded reference ratio is low enough, a barrier using page faults is
attractive.

Note that I'm actually *not* concerned about the cost of mapping operations
in the OS here, but this is because I am thinking in terms of low-and-slow
background evacuation. I'm also not concerned about mapping costs in the
nursery, because the nursery is thread-local, and I tend to assume that we
will make the mutator scan its own nursery. In that case we end up with a
per-thread pause time, but the use of a cooperative scanning method means
that we don't need to page fault.

*Naive Strategy*
*
*
The page fault is incurred on the first reference into the object being
relocated. The object reference must therefore be in a register. Update the
register to point to the new object location and let the mutator continue.
As a special case, we *require* that at least one read occur between the
load of a reference field and a copy of that reference back to the heap.
This ensures that forwarded pointers only get copied in translated form.

The performance of this strategy, to put matters delicately, *sucks*.

*Better Strategy*
*
*
Use the page fault as a detection mechanism only. Do absolutely nothing in
the page fault handler itself, but cause the mutator to resume into a
dynamic binary translator. The translator inserts read barriers on the fly
until the mutator reaches a suitable point, after which it resumes
normal-mode execution. We can optionally discontinue translation whenever
the reference count on a forwarding object goes to zero. It is correct to
resume normal execution early; the worst we will do is fault again.

The simplest way to implement this is to keep a side-table of marker bits
for target instructions. Any hardware load instruction that loads an object
reference has a bit set in the type tag table. In effect, we are using the
side table to identify object reference loads vs. data loads. This lets the
dynamic translator know which loads require read barriers and which do not.

Note that if the read barrier handler does "tier one forwarding", we can
significantly reduce the number of barriers inserted into the mutator,
similar to the proposal above. That is: whenever we update a forwarded
pointer, we scan the reference fields of the *target* object and update
those proactively and cooperatively. As I have noted elsewhere, this allows
us to exploit the "known live" relationships to reduce the number of read
barriers that must be emitted in the mutator.

*Dynamic Binary Translation*
*
*
This is perhaps the simplest option to explain, and in light of the page
fault latency it may well be the best approach, which is just to do dynamic
binary translation all the time. Please note that I do *not* mean JIT here,
so let me explain.

A JIT system generally translates from a byte code form to native
instructions. That's a fine solution, and if you can do it fast enough then
you don't need dynamic binary translation at all. The problem with a JIT
system, from our perspective, is that it is relatively expensive. In
particular, the stack-oriented nature of the byte code instruction sets
means that the JIT system is obliged to do register allocation, and *that* is
expensive.

When I refer here to dynamic binary translation, I'm referring to something
more along the lines of the HDTrans DBT, which is a very low level scheme.
In HDTrans, most input instructions are "translated" by transparent
pass-through, and the main problem is avoiding anything that might damage
the condition codes. In this case, we can do an exceptionally efficient
job, because only the object reference store instructions need to be
translated. Our experience with HDTrans suggests that the trace
linearization side effect of translation generally offsets the cost of
performing the translation. This doesn't remove the overhead of the read
barrier, but it does mean that using dynamic translation in order to *insert
* the read barrier doesn't add a net runtime cost in practice.

So the net of this option is that the collector says "I'm planning to start
forwarding objects". As the mutators arrive at safe points, they flush
their DBT cache and start doing DBT with read barriers. Or alternatively
they maintain two DBT caches. Each mutator in turn signals back that it has
switched modes. When all mutators have switched modes the collectors can
start forwarding objects on the fly.


*Optimizations and Choices*

All of the fast strategies here rely on low-level dynamic translation. The
main purpose of the page fault mechanism, if you *have* a binary
translator, is to tell you when the translator needs to be triggered.
Whether that's a good tool from a performance perspective is something that
needs measurement, but my expectation is that use of page faults *is* a
good plan. Here is why:

Old-space objects basically don't die. When they do, they tend to die in
well-localized clusters (i.e. a bunch of objects in a single block die all
at once). In the typical case, the resulting recycled block remains 90%
utilized, and the resulting free space is likely to consist of a small
number of contiguous blocks. Unless we are under extreme pressure for free
blocks, such a block is more useful as a *target* of coalescing than as a
block to be coalesced.

Blocks that are good candidates for evacuation have a smaller number of
live lines in them and/or a higher degree of fragmentation. It isn't just
liveness patterns that follow allocation order. Reference patterns also
tend to follow allocation order. Basically, if a block is old enough to
have a lot of free lines, it is probably not a "hot" block. If it is not
hot, then forcing the mutator to incur a single page fault in order to
force a DBT mode switch on first access may not be excessively expensive.

On systems that provide hardware virtualization support, it is often
possible to support the two-layer translation scheme to gain direct access
to page reference information, and also to the hardware mapping structures.
Both of these facilitate a page fault handling and a remapping mechanism
that can be much faster than (say) Linux or Windows.

There are a number of ways to do the mutator end of the work more
incrementally, most notably by patching return addresses and doing the work
in the mutator one frame at a time. With DBT I'm not even sure that's
necessary, though.

One thing I like about all of this is that there are a range of viable
options here that do not rely on particular hardware features.



OK. Done with my dump on forwarding. Need to review Zorn's dissertation to
see what I missed on this.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Object forwarding

Reply via email to