OK. This is another design write-through, this time concerning options for object forwarding in a concurrent collector.
I've been thinking a lot about forwarding, and it's really bugging me. Object relocation (for whatever reason) is really the only thing making it hard to run mutators and collectors concurrently. If we can work out how to do that with tolerable overhead, a lot of things become easier. I keep thinking that there must be a trick that I'm not seeing. But I don't think there is. Bejamin Zorn's dissertation is pretty comprehensive, though the lazy read barrier idea has some appeal. Heh. I'd forgotten about his work (approaching 50 - the brain rots), and I should probably just give Benjamin a call. The options are pretty clear. We can either add a read barrier to most object reference loads, or we can use a page fault and require that every object reference load instruction must be a safe point, or we can use dynamic translation. Or we can of course use some combination. But there is some flexibility about *when* to add the read barriers, and I haven't seen that part discussed in any of the literature. Let me run through this. I've been considering the problem of background evacuation in old space. The techniques I'm about to describe will certainly *work* for the nursery, but their overhead seems pretty high. *Issues* * * The issues here are: 1. How frequently are we relocating? That is: what percentage of the time are mutators *potentially* hazarded by object relocation "faults"? It would be a shame to carry around the code size and cost of a barrier mechanism if we are only using it 0.1% of the time. (Term: *barriered execution ratio*). Is relocation a separate collector phase, or does it happen continuously? 2. What is the dynamic *likelihood* of actually referencing an object that is being relocated? Assuming we can come up with a workable mechanism to allow the mutator to proceed after a read barrier "fault", we become progressively more willing to shift the cost of the read barrier to a higher cost mechanism (e.g. a page fault) that is dynamically rare. If we can manage it, we'd like to avoid evacuating hot objects. (Term: *hazarded reference ratio*) 3. I suspect that we *really* want to be clearing a known address range whose size is some power of two, where *all* objects in that region are being evacuated. It gives a broader range of read barrier encodings. Also, if we have to check the target of the reference for a "forwarded" bit, we end up polluting the cache with a line from an object that isn't actually there. The power of two requirement comes from a desire to reduce the number of instructions in the read barrier, and also from the ability to use the paging mechanism to assist us (at least on some machines). Have I missed any big issues? *Read Barriers* If the read barrier is implemented using code that is permanently emitted in the mutator, the simplest read barrier implementations turn object reference loads into three instructions: load (object reference value from memory) test (check for forwarding) jump-not-taken recovery path The barrier fence value almost certainly ends up living in a register, because object reference loads are frequent. This means that there are two costs to the mechanism: - A code bloat cost of about 16%. I'm assuming that 8% of instructions are object reference loads. Could be a little more or a little less. - Effective loss of a hardware register. The impact of this on modern machines isn't that big. On older Pentiums it would suck. The register loss can be eliminated if we are prepared to rewrite the code on the fly (e.g. by patching a barrier value), but if barrier hazards are frequently in effect we'll end up doing an awful lot of rewriting. This would be an example of a problem for which dynamic binary translation is well suited. Note that on out-of-order machines the cost of this is *just* code bloat. The run-time cost of the extra instructions will happily disappear into the out-of-order retirement queues. On single issue machines (notably older ARM machines and other current embedded cores), the cost of this sequence is high. Note also that read barriers do not need to occur on every object reference load. If we know (statically) that we are going to read a reference from the next object down, we can (at static compile time) direct the mutator to a recovery path that does a forwarding sweep on the target object. This requires some rendezvous assumptions between mutator and collector, but nothing dramatic and nothing that requires a global pause for synchronization. The point is that some read barriers serve as guards that make other read barriers unnecessary, which reduces the total code and performance impact. This optimization can be combined with the dynamic translation optimization below. *Page Faults* The attractions of the page fault approach are: - Runtime cost is incurred only by hazarded references that are dynamically visited. - Multiple blocks can be evacuated simultaneously. - No code bloat. The disadvantage is that on most operating systems the cost of the page fault transition is *so* high as to overwhelm these attractions. Still, if the hazarded reference ratio is low enough, a barrier using page faults is attractive. Note that I'm actually *not* concerned about the cost of mapping operations in the OS here, but this is because I am thinking in terms of low-and-slow background evacuation. I'm also not concerned about mapping costs in the nursery, because the nursery is thread-local, and I tend to assume that we will make the mutator scan its own nursery. In that case we end up with a per-thread pause time, but the use of a cooperative scanning method means that we don't need to page fault. *Naive Strategy* * * The page fault is incurred on the first reference into the object being relocated. The object reference must therefore be in a register. Update the register to point to the new object location and let the mutator continue. As a special case, we *require* that at least one read occur between the load of a reference field and a copy of that reference back to the heap. This ensures that forwarded pointers only get copied in translated form. The performance of this strategy, to put matters delicately, *sucks*. *Better Strategy* * * Use the page fault as a detection mechanism only. Do absolutely nothing in the page fault handler itself, but cause the mutator to resume into a dynamic binary translator. The translator inserts read barriers on the fly until the mutator reaches a suitable point, after which it resumes normal-mode execution. We can optionally discontinue translation whenever the reference count on a forwarding object goes to zero. It is correct to resume normal execution early; the worst we will do is fault again. The simplest way to implement this is to keep a side-table of marker bits for target instructions. Any hardware load instruction that loads an object reference has a bit set in the type tag table. In effect, we are using the side table to identify object reference loads vs. data loads. This lets the dynamic translator know which loads require read barriers and which do not. Note that if the read barrier handler does "tier one forwarding", we can significantly reduce the number of barriers inserted into the mutator, similar to the proposal above. That is: whenever we update a forwarded pointer, we scan the reference fields of the *target* object and update those proactively and cooperatively. As I have noted elsewhere, this allows us to exploit the "known live" relationships to reduce the number of read barriers that must be emitted in the mutator. *Dynamic Binary Translation* * * This is perhaps the simplest option to explain, and in light of the page fault latency it may well be the best approach, which is just to do dynamic binary translation all the time. Please note that I do *not* mean JIT here, so let me explain. A JIT system generally translates from a byte code form to native instructions. That's a fine solution, and if you can do it fast enough then you don't need dynamic binary translation at all. The problem with a JIT system, from our perspective, is that it is relatively expensive. In particular, the stack-oriented nature of the byte code instruction sets means that the JIT system is obliged to do register allocation, and *that* is expensive. When I refer here to dynamic binary translation, I'm referring to something more along the lines of the HDTrans DBT, which is a very low level scheme. In HDTrans, most input instructions are "translated" by transparent pass-through, and the main problem is avoiding anything that might damage the condition codes. In this case, we can do an exceptionally efficient job, because only the object reference store instructions need to be translated. Our experience with HDTrans suggests that the trace linearization side effect of translation generally offsets the cost of performing the translation. This doesn't remove the overhead of the read barrier, but it does mean that using dynamic translation in order to *insert * the read barrier doesn't add a net runtime cost in practice. So the net of this option is that the collector says "I'm planning to start forwarding objects". As the mutators arrive at safe points, they flush their DBT cache and start doing DBT with read barriers. Or alternatively they maintain two DBT caches. Each mutator in turn signals back that it has switched modes. When all mutators have switched modes the collectors can start forwarding objects on the fly. *Optimizations and Choices* All of the fast strategies here rely on low-level dynamic translation. The main purpose of the page fault mechanism, if you *have* a binary translator, is to tell you when the translator needs to be triggered. Whether that's a good tool from a performance perspective is something that needs measurement, but my expectation is that use of page faults *is* a good plan. Here is why: Old-space objects basically don't die. When they do, they tend to die in well-localized clusters (i.e. a bunch of objects in a single block die all at once). In the typical case, the resulting recycled block remains 90% utilized, and the resulting free space is likely to consist of a small number of contiguous blocks. Unless we are under extreme pressure for free blocks, such a block is more useful as a *target* of coalescing than as a block to be coalesced. Blocks that are good candidates for evacuation have a smaller number of live lines in them and/or a higher degree of fragmentation. It isn't just liveness patterns that follow allocation order. Reference patterns also tend to follow allocation order. Basically, if a block is old enough to have a lot of free lines, it is probably not a "hot" block. If it is not hot, then forcing the mutator to incur a single page fault in order to force a DBT mode switch on first access may not be excessively expensive. On systems that provide hardware virtualization support, it is often possible to support the two-layer translation scheme to gain direct access to page reference information, and also to the hardware mapping structures. Both of these facilitate a page fault handling and a remapping mechanism that can be much faster than (say) Linux or Windows. There are a number of ways to do the mutator end of the work more incrementally, most notably by patching return addresses and doing the work in the mutator one frame at a time. With DBT I'm not even sure that's necessary, though. One thing I like about all of this is that there are a range of viable options here that do not rely on particular hardware features. OK. Done with my dump on forwarding. Need to review Zorn's dissertation to see what I missed on this. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
