OK. It's time for me to step back from worrying about GC design and get back to actual work. I've internalized a bunch of stuff and I've managed to catch up on the last 15 years or so of the literature (almost), and I've gotten a lot from all of you in the process. I'm going to send Tony and Eliot a note arising from this discussion, and try to encourage a more focused discussion of read barriers, forwarding, and concurrent design tradeoffs in the next edition of the *The Garbage Collection Handbook*. For now, let me try to run down where I am.
I'm very impressed by the current state of RC-collectors, and I've talked elsewhere about how this approach opens up a bunch of design options and expressiveness options. The key to variable-weight and/or performance-isolated collection is to introduce [possibly dynamic] heap membranes, and keep track of the object references that cross membrane boundaries. Other techniques can do it, but hybrid-RC designs seem to do it without introducing multiplication of mechanism. Especially so when some variation of the immix heap organization is adopted. We've talked a lot about GC performance, and I've come to feel that we are using the wrong metrics. Safe language throughput seems to be a mostly solved problem at this point, by which I mean both that (a) the difference relative to explicit storage management is now very small, and (b) any inherent cost of safety is now low enough that I'm willing to pay it, and I think a lot of other people are too. The question of pause times is a different matter, and I've become convinced that by focusing on pause times we have been framing the question the wrong way. I think we should be focusing on three variables: - A time interval (T). Moving closer to hard real-time applications may require shorter intervals. - Mutator *progress* during each time interval (P). Not just "what percentage of time is spent in the mutator" but also "How much real-world computational progress is it making?" Spending 10% more time in the mutator while adding 12% overhead to the mutator isn't winning. - *Variance* of that progress over that time interval (V) We would like flexibility in choosing T, possibly at some cost in the rate of progress P, the variance of progress V, or both. Unfortunately our collection needs are also bounded by allocation rate and mutate rate (e.g. through logging), so this isn't a complete characterization. If we can arrive at a solution that makes these things somewhat tunable, I don't think we should be worrying about the last 2% of throughput performance. On parallel (as opposed to concurrent) collectors we are now looking at <50ms pause times on small-scale multiprocessors, with 10% or less (typ) of total time spent in storage management [*] While we incur both code bloat and execution overhead in barrier code, that is made up (relative to conventional allocators) in reduced L1 cache miss rates. The code bloat is * also* made up (usually) by decreases in the overhead information that must be recorded on each allocation. I've been talking, in the background, to people I know who build production video games and codecs. Faced with the prospect of a 50ms pause, they mostly didn't care. Several of them noted that current context switch and page fault times often exceed this already, though the "all threads" aspect of the pause may have implications they weren't considering. *It is not obvious that on-the-fly is better for real-time applications, nor is it obvious that on-the-fly is required for real-time applications.* [*] The exception is applications that bulk allocate and bulk deallocate durable structures, Xalan being a notable example. I believe we all agree that there are cases where the programmer knows what their needs are, and the allocator could cooperate with those needs if a means existed to express them. This won't solve all cases, but it will solve a lot of them. Almost all of the mutator performance in these discussions hinges on read barriers: do we need them, and if so, do they need to be fine-grained (i.e. inlined with the load instructions) or can they be done with page protections? Here are the answers I have come to (so far): 1. Read barriers of some form are required on any sub-heap where collectors and mutators may operate concurrently and evacuation/relocation may be performed. This means that it is *extremely* desirable to use thread-local nursery allocation. 2. Fine-grain read barriers are *sometimes *required if the mutator participates cooperatively in the maintenance of the tri-color (or K-color) invariant. It depends on the type of mutator. 3. Fine-grain read barriers are required if heap regions that are subject to recoloring and/or evacuation are "reference hot", because the page fault approach isn't cost effective. 4. Fine-grain read barriers are required in practice if nurseries are not thread-local. Taken in combination, these conditions mean that read barriers are needed for any on-the-fly collection scheme, and also for any fully concurrent (as opposed to *mostly* concurrent) scheme. I've noted in discussion that there are ways to emit code in which read barriers can be turned on and off dynamically through low-level dynamic translation. This prospect gives us a late binding option on the choice of collector integration strategy, which strikes me as elegant. It also means that we can separate many of the GC integration issues from the compiler entirely. We can instead make our choices late on the basis of how hot the forwarded objects are, what proportion of mutator execution time is overlapped with collector execution, the dynamic frequency and dynamic locality with which the conditions that *trigger* read barriers arise, and our target objectives for P, V, and T. I have been focusing a lot on relocating collectors, mainly because evacuation of some form seems to be a necessary activity for heap defragmentation. It isn't clear that this is really required. In the immix design, for example, we're already prepared to do chunked bump allocation. It is conceivable that a block could be considered "free" if it is "free enough" without being totally cleared. If that is possible, then the need for fine-grain read barriers goes away. It is also possible that we could size-segregate the blocks in the tenured heap to control fragmentation. It is also possible that inter-operation with C creates enough dynamic restrictions on relocation (through pinning) that relocation isn't that helpful. The pressure to relocate is a function of total heap size pressure, which is inversely related to available system DRAM. An unfortunate implication of this is that small platforms are more likely to use read barriers (with their associated code bloat) than large platforms. The practical benefit of on-the-fly collection (as opposed to concurrent collection) seems uncertain. When all's done, there are only so many hardware threads of execution available, and the activity of these hardware threads is going to go back and forth between mutating and collecting. The "flip" can be implemented voluntarily in the mutator or implicitly in the operating system. The cost of doing it implicitly is (a) need for a fine-grain read barrier, (b) loss of mutator control about when collection happens, and (c) lost ability to drive collection in finer-grain increments. The primary *advantage* of on-the-fly is that it allows collection to proceed during long I/O waits. Once a fine-grain read barrier is required, it is worth considering how many of the checks previously performed by write barriers can instead be moved to read barriers. It is also worth considering a notion of a "read-write barrier". That is: combining the read and write barriers when we are reading a reference to an object that we know we intend to write. I haven't spent much time on either question; the high-order concern was to determine whether a fine-grain read barrier was required at all. Once a fine-grain read barrier is required, you have in some respects "paid the piper", and you might as well get all of the advantage out of it that you can. Last, but not least, I note that one of the challenges in deferred reference counting is that it makes on-the-fly collection a bit tricky. The essence of it is that something analogous to a tri-color invariant needs to be maintained on the deferred decrement log so that mutators can go through a "phase agreement" with the collector. It could be that this is done using the tri-color invariant itself. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
