Oh, wait. This isn't about improving branch prediction. This is about eliminating the read barriers entirely when conditions do not require them. The inlined portion of the read barrier goes away *entirely*. This reduces dynamic instruction counts by 28% to 36% compared to code that contains read barrier checks, and eliminates a large number of hard-to-schedule data-dependent loads. The overhead of the translation itself is compensated by the fact that the translation process performs trace linearization as a side effect. With a little bit of care, it can also keep track of which registers currently hold object references.
shap On Sun, Oct 20, 2013 at 6:04 PM, Sandro Magi <[email protected]> wrote: > Re: dynamic translation for read barriers, my initial impression is > skepticism that the improved branch prediction a dynamically swapped read > barrier would outweigh the cost of i-cache flushes. Has this sort of design > actually been measured? > > Sandro > > > > "Jonathan S. Shapiro" <[email protected]> wrote: > > OK. It's time for me to step back from worrying about GC design and get > back to actual work. I've internalized a bunch of stuff and I've managed to > catch up on the last 15 years or so of the literature (almost), and I've > gotten a lot from all of you in the process. I'm going to send Tony and > Eliot a note arising from this discussion, and try to encourage a more > focused discussion of read barriers, forwarding, and concurrent design > tradeoffs in the next edition of the *The Garbage Collection Handbook*. > For now, let me try to run down where I am. > > I'm very impressed by the current state of RC-collectors, and I've talked > elsewhere about how this approach opens up a bunch of design options and > expressiveness options. The key to variable-weight and/or > performance-isolated collection is to introduce [possibly dynamic] heap > membranes, and keep track of the object references that cross membrane > boundaries. Other techniques can do it, but hybrid-RC designs seem to do it > without introducing multiplication of mechanism. Especially so when some > variation of the immix heap organization is adopted. > > We've talked a lot about GC performance, and I've come to feel that we are > using the wrong metrics. Safe language throughput seems to be a mostly > solved problem at this point, by which I mean both that (a) the difference > relative to explicit storage management is now very small, and (b) any > inherent cost of safety is now low enough that I'm willing to pay it, and I > think a lot of other people are too. The question of pause times is a > different matter, and I've become convinced that by focusing on pause times > we have been framing the question the wrong way. I think we should be > focusing on three variables: > > - A time interval (T). Moving closer to hard real-time applications > may require shorter intervals. > - Mutator *progress* during each time interval (P). Not just "what > percentage of time is spent in the mutator" but also "How much real-world > computational progress is it making?" Spending 10% more time in the mutator > while adding 12% overhead to the mutator isn't winning. > - *Variance* of that progress over that time interval (V) > > We would like flexibility in choosing T, possibly at some cost in the rate > of progress P, the variance of progress V, or both. Unfortunately our > collection needs are also bounded by allocation rate and mutate rate (e.g. > through logging), so this isn't a complete characterization. If we can > arrive at a solution that makes these things somewhat tunable, I don't > think we should be worrying about the last 2% of throughput performance. > > On parallel (as opposed to concurrent) collectors we are now looking at > <50ms pause times on small-scale multiprocessors, with 10% or less (typ) of > total time spent in storage management [*] While we incur both code bloat > and execution overhead in barrier code, that is made up (relative to > conventional allocators) in reduced L1 cache miss rates. The code bloat is > *also* made up (usually) by decreases in the overhead information that > must be recorded on each allocation. I've been talking, in the background, > to people I know who build production video games and codecs. Faced with > the prospect of a 50ms pause, they mostly didn't care. Several of them > noted that current context switch and page fault times often exceed this > already, though the "all threads" aspect of the pause may have implications > they weren't considering. *It is not obvious that on-the-fly is better > for real-time applications, nor is it obvious that on-the-fly is required for > real-time applications.* > > [*] The exception is applications that bulk allocate and bulk deallocate > durable structures, Xalan being a notable example. I believe we all agree > that there are cases where the programmer knows what their needs are, and > the allocator could cooperate with those needs if a means existed to > express them. This won't solve all cases, but it will solve a lot of them. > > > Almost all of the mutator performance in these discussions hinges on read > barriers: do we need them, and if so, do they need to be fine-grained (i.e. > inlined with the load instructions) or can they be done with page > protections? Here are the answers I have come to (so far): > > 1. Read barriers of some form are required on any sub-heap where > collectors and mutators may operate concurrently and evacuation/relocation > may be performed. This means that it is *extremely* desirable to use > thread-local nursery allocation. > 2. Fine-grain read barriers are *sometimes *required if the mutator > participates cooperatively in the maintenance of the tri-color (or K-color) > invariant. It depends on the type of mutator. > 3. Fine-grain read barriers are required if heap regions that are > subject to recoloring and/or evacuation are "reference hot", because the > page fault approach isn't cost effective. > 4. Fine-grain read barriers are required in practice if nurseries are > not thread-local. > > Taken in combination, these conditions mean that read barriers are needed > for any on-the-fly collection scheme, and also for any fully concurrent (as > opposed to *mostly* concurrent) scheme. > > I've noted in discussion that there are ways to emit code in which read > barriers can be turned on and off dynamically through low-level dynamic > translation. This prospect gives us a late binding option on the choice of > collector integration strategy, which strikes me as elegant. It also means > that we can separate many of the GC integration issues from the compiler > entirely. We can instead make our choices late on the basis of how hot the > forwarded objects are, what proportion of mutator execution time is > overlapped with collector execution, the dynamic frequency and dynamic > locality with which the conditions that *trigger* read barriers arise, > and our target objectives for P, V, and T. > > I have been focusing a lot on relocating collectors, mainly because > evacuation of some form seems to be a necessary activity for heap > defragmentation. It isn't clear that this is really required. In the immix > design, for example, we're already prepared to do chunked bump allocation. > It is conceivable that a block could be considered "free" if it is "free > enough" without being totally cleared. If that is possible, then the need > for fine-grain read barriers goes away. It is also possible that we could > size-segregate the blocks in the tenured heap to control fragmentation. It > is also possible that inter-operation with C creates enough dynamic > restrictions on relocation (through pinning) that relocation isn't that > helpful. The pressure to relocate is a function of total heap size > pressure, which is inversely related to available system DRAM. An > unfortunate implication of this is that small platforms are more likely to > use read barriers (with their associated code bloat) than large platforms. > > The practical benefit of on-the-fly collection (as opposed to concurrent > collection) seems uncertain. When all's done, there are only so many > hardware threads of execution available, and the activity of these hardware > threads is going to go back and forth between mutating and collecting. The > "flip" can be implemented voluntarily in the mutator or implicitly in the > operating system. The cost of doing it implicitly is (a) need for a > fine-grain read barrier, (b) loss of mutator control about when collection > happens, and (c) lost ability to drive collection in finer-grain > increments. The primary *advantage* of on-the-fly is that it allows > collection to proceed during long I/O waits. > > Once a fine-grain read barrier is required, it is worth considering how > many of the checks previously performed by write barriers can instead be > moved to read barriers. It is also worth considering a notion of a > "read-write barrier". That is: combining the read and write barriers when > we are reading a reference to an object that we know we intend to write. I > haven't spent much time on either question; the high-order concern was to > determine whether a fine-grain read barrier was required at all. > > Once a fine-grain read barrier is required, you have in some respects > "paid the piper", and you might as well get all of the advantage out of it > that you can. > > Last, but not least, I note that one of the challenges in deferred > reference counting is that it makes on-the-fly collection a bit tricky. The > essence of it is that something analogous to a tri-color invariant needs to > be maintained on the deferred decrement log so that mutators can go through > a "phase agreement" with the collector. It could be that this is done using > the tri-color invariant itself. > > > shap > > _______________________________________________ > bitc-dev mailing list > [email protected] > http://www.coyotos.org/mailman/listinfo/bitc-dev > >
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
