Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Ben Kloosterman Tue, 19 Nov 2013 20:56:57 -0800

On Wed, Nov 20, 2013 at 1:27 AM, Jonathan S. Shapiro <[email protected]>wrote:

> On Mon, Nov 18, 2013 at 7:32 PM, Ben Kloosterman <[email protected]>wrote:
>
>> On Mon, Nov 18, 2013 at 11:41 PM, Jonathan S. Shapiro 
>> <[email protected]>wrote:
>>
>>>
>>> Right. But if you know you are going to allocate an 8 word object
>>> followed by a 12 word object, you can just do a single 24 word (allowing
>>> for headers) allocation. Given the inline sequence you sent out, a
>>> conventional compiler would do this optimization more or less
>>> automatically, except that the procedure calls to the long path are an
>>> impediment.
>>>
>>
>> In Jikes this is external C code not likeley to be inlined  .. but that
>> shouldnt stop another runtime.
>>
>
> Wait. Are you saying that the *fast path* isn't inlined?
>

The new obj code is static C which calls the java allocator , the java is
Jit-ted .. So im not sure but i doubt it  ( since the static C code would
need to inline the JIT code)  .. but note Jikes goal is not great
performance it uses the latest algorithms but is a almost pure java
platform so lags the Oracle JRE. For GC research showing relative
performance on the same platform is sufficient.

I think it goes something like this
C Code
if size > Large_SIze
allocator ==1
else
allocator ==2

Then call the code below which then calls the alloc method.

>
> Combining allocations requires that the front end knows something about
> the allocator code, so that it knows the call to the slow path (which is an
> intrinsic call) can be combined. The ability to do this combination is one
> of the big advantages of a copying nursery.
>
> Hmm. Come to that, I'm not thinking about this right, and the sequence you
> sent is wrong. That is: I'm confident they are using the sequence as you
> describe it, but that's not the one we want in mainline code.
>
> The sequence you sent is the one that would be used for the object
> relocator. That's the one that gets used to allocate storage when
> evacuating a block or copying things out of the nursery. The reason I'm
> sure of this is the test
>
>      if (bytes > BYTES_IN_LINE) {
>
> which we wouldn't be doing in a nursery allocation (because the size there
> is statically known).
>

Pretty sure this is incorrect, the copy use the same alloc code , Immix has
no Nursery at all , RC-Immix has a bunch of empty blocks called the
reserved copy space , when it copies it reallocates a new object . ( eg
young allocator is of type RCImmixAllocator  , copy alocator is of type
RCImmixAllocator) so its just a form of compaction i would not call it a
Nursery either. These filled blocks then are no longer part of the copy
space ( which is just empty blocks  to make sure there is room for a copy).

The line above is already in the slow path so it doesnt matter much ( eg
limit exceed means you need to find a new block or hole) , the bytes is
passed in by the C run time this has to be a variable .

>
> But if we use a copying nursery (i.e. a hybrid approach), then the nursery
> itself is *always* a free block (perhaps bigger than 32K, but still a
> free block).
>

Correct and note with such a copying Nursery i see no reason to move
objects or defragment ..the Nursery will remove most of the fragmentation
and the benefit is not worth the cost since the heap fills most holes. This
means for a highly concurrent GC you only need to be concerned with moves
from the copying Nursery.

There is a complication however right now there is no ref counting at all
in RC-Immix if there is sufficient space . Ref counting only occurs durring
collections , this sync is important.  eg Allocate from free  Collect  (
ref count and create holes)   then repeat Allocate from  free , Allocate to
 holes  , Collect..  Im not sure if there are any holes created by
processing a modbuff/newbuff and ref counts going to 0 outside of the
collect phase.  The complication is freeing these objects cant use ref
counts else it would need to be synced with the ref count hence it will
just do a GC Mark  and you probably dont want these sync'd with the ref
count as Nursery sweeps are too frequent.

Note if these counts are only done durring a GC then IMHO its not really a
ref counting GC  but merely a standard GC where the ref counting replaces
the mark and card table.

> When allocating from a free block, there is no internal fragmentation, and
> if the allocation fails you're going to have to do a full nursery
> collection (and coalesce or copy) in any case. Under those conditions, the
> bump allocation reduces to (borrowing from your code, possibly incorrectly):
>
> do {
>
>         Address start = alignAllocationNoFill(cursor, align, offset);
>         Address end = start.plus(bytes);
>         if (end.LE(limit)) break;
>
>         GC_nursery();
>     }
>     cursor = end;
>
> Correct ..  you changed the sign on the LE which got me ( a horrible
construct but  they can compare address like that and not make it a numeric
type)

> We can very easily impose a rule in the nursery that maintains /start/ at an 
> two word boundary. In that case, for typically aligned allocations, it 
> reduces to:
>
>     do {
>         Address start = cursor + 2 * sizeof(uintptr_t); /* GC hdr, vtable */
>         Address end = start.plus(CompilerRoundsUp(bytes, 8));
>
>         if (end.GT(limit) GC_nursery();
>     }
>     cursor = end;
>
> Agree. Was thinking the same thing , there is quite a lot of allignment
code which i wandered about but its because Jikes code is uniform ( there
is no platform specific conditionals) so when running on a 386 it would use
allign 4.

> And if we institute that alignment requirement on the cursor, then the
> probe hoists to the top of the procedure, and we get:
>
>     /* Pre-probe the available nursery space */
>     end = cursor + SpaceWeNeed;    if (end.GT(limit)) GC_nursery();
>
>     /* At this point, we have permission to allocate up to SpaceWeNeed 
> without checking. */
>     Address start = cursor + 2 * sizeof(uintptr_t);
>     cursor += StaticObjectSizeAfterRoundup;
>
> Whether you can do static object sizes depend on the integration between
the runtime object creation, allocator , compiler/jit and app.  If it was
all a same language blob then sure.

> And, of course, a lot of this will peephole out if the compiler is aggressive 
> about it. This approach won't work for loops, but it still helps.
>
> aggressive loop unrolling will help though.

> Note here that I'm *only* talking about nursery allocations! This would 
> hypothetically work in a recycled block, but it would leave big holes in 
> practice.
>
> A Hybrid like Gen-Immix Nursery would not allocate to recycled blocks but
the sweep would.  allocating to recycled blocks would complicate the
mark/sweep and references from old to new space. No point going there...

>  Regarding the stack, there is another issue that I hadn't considered: if
>>>>> we use an immix block of some form for the stack, we can't rely on the
>>>>> collector to zero it in the background. We would have needed to zero
>>>>> reference slots anyway, so it's not like this is a new cost, but the
>>>>> stack-of-objects approach may change how eagerly we need to do that.
>>>>> Actually, I don't think it does, but it needs looking at.
>>>>>
>>>>
>>>> I take it we cant rely on the collector to do this because collector
>>>> threads  have immix block for a stack .
>>>>
>>>
>>> No no. We'll get an initially zero block just fine. The problem is that
>>> the stack pointer is going up and down within the block. As procedures
>>> return, they leave dirty lines "below" the stack. This happens *way* too
>>> rapidly for background zeroing to work.
>>>
>>
>> I was thinking just the intial clear when the block is received as it is
>> received at the moment dirty..
>>
>
> Zeroing it at allocation rather than release simply shifts the work from
> the GC to the mutator. My guess is it doesn't matter from a performance
> perspective.
>

Agree at the moment it doesnt matter but if the Collector was concurrent or
free objects are discovered through counts between collects then it would.

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Non copying Rc-Immix was Putting the stack on an RC-Immix heap

Reply via email to