On Sun, Oct 20, 2013 at 4:53 AM, Bennie Kloosteman <[email protected]>wrote:

> One thing worth noting is a lot of  papers use per thread allocators which
> give better performance in situations with lots of concurrent allocating
> threads.   Hence they use small nurseries that fit in the cores L2. This we
> know.
>

Yup. We've been talking a lot about thread-local allocators. The tricky bit
that I haven't seen discussed in papers is making sure that pointers from
one thread-local nursery don't end up pointing to another. It's easy to
maintain that invariant if you have a read barrier. Not so much without one.


> However with  a hybrid model the cost of using the main heap is much
> greater than with a generational GC  due to introducing ref counting
>  especially ref counting on short lived objects many of which may not  have
> been counted and allowing cycles which may not have escaped a larger
> nursery.
>

You may be right, but this is one of those tempting statements that needs
to be confirmed with measurement. Intuitions on this sort of thing can be
deceptive. Adolescent objects (objects newly promoted to the main heap) are
recently written, and therefore have their M *and* N bits set (RC-immix
didn't address this optimization). As such, writes to these objects are
unimpeded and the "integrate" operation on the object is deferred. This
mitigates both the ref counting cost and the likelihood of cycle creation.

Not saying you're wrong. What I'm mainly saying is that the RC-immix
encodings can be extended to buy you a kind of "shared nursery" in the
general heap that survives until the next major cycle.

The URC paper found best performance at very large nurseries ( 4 Meg was
> pretty big for a Nursery 10-12 years ago) but such big nurseries per thread
> scaled to today uses a lot of memory. Hence an interlocked  single nursery
> may be more memory efficient  and avoid synch issues.
>

Nope. There is no way that an interlocked nursery is more efficient, though
I agree that large nursery sizes are a problem. But as a middle position,
you might do TLA over 32KB nursery chunks, and then interlocked operations
to get new chunks. Lots of options here for amortizing costs.

I should perhaps explain that I'm concerned to have an approach that will
work on weakly ordered memories. That makes the cost of interlock even
higher than it is on Intel processors, which is already astonishingly high.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to