On Wed, Nov 20, 2013 at 1:27 AM, Jonathan S. Shapiro <[email protected]>wrote:
> On Mon, Nov 18, 2013 at 7:32 PM, Ben Kloosterman <[email protected]>wrote: > >> On Mon, Nov 18, 2013 at 11:41 PM, Jonathan S. Shapiro >> <[email protected]>wrote: >> >>> >>> Right. But if you know you are going to allocate an 8 word object >>> followed by a 12 word object, you can just do a single 24 word (allowing >>> for headers) allocation. Given the inline sequence you sent out, a >>> conventional compiler would do this optimization more or less >>> automatically, except that the procedure calls to the long path are an >>> impediment. >>> >> >> In Jikes this is external C code not likeley to be inlined .. but that >> shouldnt stop another runtime. >> > > Wait. Are you saying that the *fast path* isn't inlined? > The new obj code is static C which calls the java allocator , the java is Jit-ted .. So im not sure but i doubt it ( since the static C code would need to inline the JIT code) .. but note Jikes goal is not great performance it uses the latest algorithms but is a almost pure java platform so lags the Oracle JRE. For GC research showing relative performance on the same platform is sufficient. I think it goes something like this C Code if size > Large_SIze allocator ==1 else allocator ==2 Then call the code below which then calls the alloc method. > > Combining allocations requires that the front end knows something about > the allocator code, so that it knows the call to the slow path (which is an > intrinsic call) can be combined. The ability to do this combination is one > of the big advantages of a copying nursery. > > Hmm. Come to that, I'm not thinking about this right, and the sequence you > sent is wrong. That is: I'm confident they are using the sequence as you > describe it, but that's not the one we want in mainline code. > > The sequence you sent is the one that would be used for the object > relocator. That's the one that gets used to allocate storage when > evacuating a block or copying things out of the nursery. The reason I'm > sure of this is the test > > if (bytes > BYTES_IN_LINE) { > > which we wouldn't be doing in a nursery allocation (because the size there > is statically known). > Pretty sure this is incorrect, the copy use the same alloc code , Immix has no Nursery at all , RC-Immix has a bunch of empty blocks called the reserved copy space , when it copies it reallocates a new object . ( eg young allocator is of type RCImmixAllocator , copy alocator is of type RCImmixAllocator) so its just a form of compaction i would not call it a Nursery either. These filled blocks then are no longer part of the copy space ( which is just empty blocks to make sure there is room for a copy). The line above is already in the slow path so it doesnt matter much ( eg limit exceed means you need to find a new block or hole) , the bytes is passed in by the C run time this has to be a variable . > > But if we use a copying nursery (i.e. a hybrid approach), then the nursery > itself is *always* a free block (perhaps bigger than 32K, but still a > free block). > Correct and note with such a copying Nursery i see no reason to move objects or defragment ..the Nursery will remove most of the fragmentation and the benefit is not worth the cost since the heap fills most holes. This means for a highly concurrent GC you only need to be concerned with moves from the copying Nursery. There is a complication however right now there is no ref counting at all in RC-Immix if there is sufficient space . Ref counting only occurs durring collections , this sync is important. eg Allocate from free Collect ( ref count and create holes) then repeat Allocate from free , Allocate to holes , Collect.. Im not sure if there are any holes created by processing a modbuff/newbuff and ref counts going to 0 outside of the collect phase. The complication is freeing these objects cant use ref counts else it would need to be synced with the ref count hence it will just do a GC Mark and you probably dont want these sync'd with the ref count as Nursery sweeps are too frequent. Note if these counts are only done durring a GC then IMHO its not really a ref counting GC but merely a standard GC where the ref counting replaces the mark and card table. > When allocating from a free block, there is no internal fragmentation, and > if the allocation fails you're going to have to do a full nursery > collection (and coalesce or copy) in any case. Under those conditions, the > bump allocation reduces to (borrowing from your code, possibly incorrectly): > > do { > > Address start = alignAllocationNoFill(cursor, align, offset); > Address end = start.plus(bytes); > if (end.LE(limit)) break; > > GC_nursery(); > } > cursor = end; > > Correct .. you changed the sign on the LE which got me ( a horrible construct but they can compare address like that and not make it a numeric type) > We can very easily impose a rule in the nursery that maintains /start/ at an > two word boundary. In that case, for typically aligned allocations, it > reduces to: > > do { > Address start = cursor + 2 * sizeof(uintptr_t); /* GC hdr, vtable */ > Address end = start.plus(CompilerRoundsUp(bytes, 8)); > > if (end.GT(limit) GC_nursery(); > } > cursor = end; > > Agree. Was thinking the same thing , there is quite a lot of allignment code which i wandered about but its because Jikes code is uniform ( there is no platform specific conditionals) so when running on a 386 it would use allign 4. > And if we institute that alignment requirement on the cursor, then the > probe hoists to the top of the procedure, and we get: > > /* Pre-probe the available nursery space */ > end = cursor + SpaceWeNeed; if (end.GT(limit)) GC_nursery(); > > /* At this point, we have permission to allocate up to SpaceWeNeed > without checking. */ > Address start = cursor + 2 * sizeof(uintptr_t); > cursor += StaticObjectSizeAfterRoundup; > > Whether you can do static object sizes depend on the integration between the runtime object creation, allocator , compiler/jit and app. If it was all a same language blob then sure. > And, of course, a lot of this will peephole out if the compiler is aggressive > about it. This approach won't work for loops, but it still helps. > > aggressive loop unrolling will help though. > Note here that I'm *only* talking about nursery allocations! This would > hypothetically work in a recycled block, but it would leave big holes in > practice. > > A Hybrid like Gen-Immix Nursery would not allocate to recycled blocks but the sweep would. allocating to recycled blocks would complicate the mark/sweep and references from old to new space. No point going there... > Regarding the stack, there is another issue that I hadn't considered: if >>>>> we use an immix block of some form for the stack, we can't rely on the >>>>> collector to zero it in the background. We would have needed to zero >>>>> reference slots anyway, so it's not like this is a new cost, but the >>>>> stack-of-objects approach may change how eagerly we need to do that. >>>>> Actually, I don't think it does, but it needs looking at. >>>>> >>>> >>>> I take it we cant rely on the collector to do this because collector >>>> threads have immix block for a stack . >>>> >>> >>> No no. We'll get an initially zero block just fine. The problem is that >>> the stack pointer is going up and down within the block. As procedures >>> return, they leave dirty lines "below" the stack. This happens *way* too >>> rapidly for background zeroing to work. >>> >> >> I was thinking just the intial clear when the block is received as it is >> received at the moment dirty.. >> > > Zeroing it at allocation rather than release simply shifts the work from > the GC to the mutator. My guess is it doesn't matter from a performance > perspective. > Agree at the moment it doesnt matter but if the Collector was concurrent or free objects are discovered through counts between collects then it would. Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
