Re: [bitc-dev] Summary on GC

Jonathan S. Shapiro Sun, 20 Oct 2013 19:59:14 -0700

Oh, wait.

This isn't about improving branch prediction. This is about eliminating the
read barriers entirely when conditions do not require them. The inlined
portion of the read barrier goes away *entirely*. This reduces dynamic
instruction counts by 28% to 36% compared to code that contains read
barrier checks, and eliminates a large number of hard-to-schedule
data-dependent loads. The overhead of the translation itself is compensated
by the fact that the translation process performs trace linearization as a
side effect. With a little bit of care, it can also keep track of which
registers currently hold object references.



shap


On Sun, Oct 20, 2013 at 6:04 PM, Sandro Magi <[email protected]> wrote:

> Re: dynamic translation for read barriers, my initial impression is
> skepticism that the improved branch prediction a dynamically swapped read
> barrier would outweigh the cost of i-cache flushes. Has this sort of design
> actually been measured?
>
> Sandro
>
>
>
> "Jonathan S. Shapiro" <[email protected]> wrote:
>
> OK. It's time for me to step back from worrying about GC design and get
> back to actual work. I've internalized a bunch of stuff and I've managed to
> catch up on the last 15 years or so of the literature (almost), and I've
> gotten a lot from all of you in the process. I'm going to send Tony and
> Eliot a note arising from this discussion, and try to encourage a more
> focused discussion of read barriers, forwarding, and concurrent design
> tradeoffs in the next edition of the *The Garbage Collection Handbook*.
> For now, let me try to run down where I am.
>
> I'm very impressed by the current state of RC-collectors, and I've talked
> elsewhere about how this approach opens up a bunch of design options and
> expressiveness options. The key to variable-weight and/or
> performance-isolated collection is to introduce [possibly dynamic] heap
> membranes, and keep track of the object references that cross membrane
> boundaries. Other techniques can do it, but hybrid-RC designs seem to do it
> without introducing multiplication of mechanism. Especially so when some
> variation of the immix heap organization is adopted.
>
> We've talked a lot about GC performance, and I've come to feel that we are
> using the wrong metrics. Safe language throughput seems to be a mostly
> solved problem at this point, by which I mean both that (a) the difference
> relative to explicit storage management is now very small, and (b) any
> inherent cost of safety is now low enough that I'm willing to pay it, and I
> think a lot of other people are too. The question of pause times is a
> different matter, and I've become convinced that by focusing on pause times
> we have been framing the question the wrong way. I think we should be
> focusing on three variables:
>
>    - A time interval (T). Moving closer to hard real-time applications
>    may require shorter intervals.
>    - Mutator *progress* during each time interval (P). Not just "what
>    percentage of time is spent in the mutator" but also "How much real-world
>    computational progress is it making?" Spending 10% more time in the mutator
>    while adding 12% overhead to the mutator isn't winning.
>    - *Variance* of that progress over that time interval (V)
>
> We would like flexibility in choosing T, possibly at some cost in the rate
> of progress P, the variance of progress V, or both. Unfortunately our
> collection needs are also bounded by allocation rate and mutate rate (e.g.
> through logging), so this isn't a complete characterization. If we can
> arrive at a solution that makes these things somewhat tunable, I don't
> think we should be worrying about the last 2% of throughput performance.
>
> On parallel (as opposed to concurrent) collectors we are now looking at
> <50ms pause times on small-scale multiprocessors, with 10% or less (typ) of
> total time spent in storage management [*] While we incur both code bloat
> and execution overhead in barrier code, that is made up (relative to
> conventional allocators) in reduced L1 cache miss rates. The code bloat is
> *also* made up (usually) by decreases in the overhead information that
> must be recorded on each allocation. I've been talking, in the background,
> to people I know who build production video games and codecs. Faced with
> the prospect of a 50ms pause, they mostly didn't care. Several of them
> noted that current context switch and page fault times often exceed this
> already, though the "all threads" aspect of the pause may have implications
> they weren't considering.  *It is not obvious that on-the-fly is better
> for real-time applications, nor is it obvious that on-the-fly is required for
> real-time applications.*
>
> [*] The exception is applications that bulk allocate and bulk deallocate
> durable structures, Xalan being a notable example. I believe we all agree
> that there are cases where the programmer knows what their needs are, and
> the allocator could cooperate with those needs if a means existed to
> express them. This won't solve all cases, but it will solve a lot of them.
>
>
> Almost all of the mutator performance in these discussions hinges on read
> barriers: do we need them, and if so, do they need to be fine-grained (i.e.
> inlined with the load instructions) or can they be done with page
> protections? Here are the answers I have come to (so far):
>
>    1. Read barriers of some form are required on any sub-heap where
>    collectors and mutators may operate concurrently and evacuation/relocation
>    may be performed. This means that it is *extremely* desirable to use
>    thread-local nursery allocation.
>    2. Fine-grain read barriers are *sometimes *required if the mutator
>    participates cooperatively in the maintenance of the tri-color (or K-color)
>    invariant. It depends on the type of mutator.
>    3. Fine-grain read barriers are required if heap regions that are
>    subject to recoloring and/or evacuation are "reference hot", because the
>    page fault approach isn't cost effective.
>    4. Fine-grain read barriers are required in practice if nurseries are
>    not thread-local.
>
> Taken in combination, these conditions mean that read barriers are needed
> for any on-the-fly collection scheme, and also for any fully concurrent (as
> opposed to *mostly* concurrent) scheme.
>
> I've noted in discussion that there are ways to emit code in which read
> barriers can be turned on and off dynamically through low-level dynamic
> translation. This prospect gives us a late binding option on the choice of
> collector integration strategy, which strikes me as elegant. It also means
> that we can separate many of the GC integration issues from the compiler
> entirely. We can instead make our choices late on the basis of how hot the
> forwarded objects are, what proportion of mutator execution time is
> overlapped with collector execution, the dynamic frequency and dynamic
> locality with which the conditions that *trigger* read barriers arise,
> and our target objectives for P, V, and T.
>
> I have been focusing a lot on relocating collectors, mainly because
> evacuation of some form seems to be a necessary activity for heap
> defragmentation. It isn't clear that this is really required. In the immix
> design, for example, we're already prepared to do chunked bump allocation.
> It is conceivable that a block could be considered "free" if it is "free
> enough" without being totally cleared. If that is possible, then the need
> for fine-grain read barriers goes away. It is also possible that we could
> size-segregate the blocks in the tenured heap to control fragmentation. It
> is also possible that inter-operation with C creates enough dynamic
> restrictions on relocation (through pinning) that relocation isn't that
> helpful. The pressure to relocate is a function of total heap size
> pressure, which is inversely related to available system DRAM. An
> unfortunate implication of this is that small platforms are more likely to
> use read barriers (with their associated code bloat) than large platforms.
>
> The practical benefit of on-the-fly collection (as opposed to concurrent
> collection) seems uncertain. When all's done, there are only so many
> hardware threads of execution available, and the activity of these hardware
> threads is going to go back and forth between mutating and collecting. The
> "flip" can be implemented voluntarily in the mutator or implicitly in the
> operating system. The cost of doing it implicitly is (a) need for a
> fine-grain read barrier, (b) loss of mutator control about when collection
> happens, and (c) lost ability to drive collection in finer-grain
> increments. The primary *advantage* of on-the-fly is that it allows
> collection to proceed during long I/O waits.
>
> Once a fine-grain read barrier is required, it is worth considering how
> many of the checks previously performed by write barriers can instead be
> moved to read barriers. It is also worth considering a notion of a
> "read-write barrier". That is: combining the read and write barriers when
> we are reading a reference to an object that we know we intend to write. I
> haven't spent much time on either question; the high-order concern was to
> determine whether a fine-grain read barrier was required at all.
>
> Once a fine-grain read barrier is required, you have in some respects
> "paid the piper", and you might as well get all of the advantage out of it
> that you can.
>
> Last, but not least, I note that one of the challenges in deferred
> reference counting is that it makes on-the-fly collection a bit tricky. The
> essence of it is that something analogous to a tri-color invariant needs to
> be maintained on the deferred decrement log so that mutators can go through
> a "phase agreement" with the collector. It could be that this is done using
> the tri-color invariant itself.
>
>
> shap
>
> _______________________________________________
> bitc-dev mailing list
> [email protected]
> http://www.coyotos.org/mailman/listinfo/bitc-dev
>
>

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Summary on GC

Reply via email to