On Sun, Oct 20, 2013 at 4:53 AM, Bennie Kloosteman <[email protected]>wrote:
> One thing worth noting is a lot of papers use per thread allocators which > give better performance in situations with lots of concurrent allocating > threads. Hence they use small nurseries that fit in the cores L2. This we > know. > Yup. We've been talking a lot about thread-local allocators. The tricky bit that I haven't seen discussed in papers is making sure that pointers from one thread-local nursery don't end up pointing to another. It's easy to maintain that invariant if you have a read barrier. Not so much without one. > However with a hybrid model the cost of using the main heap is much > greater than with a generational GC due to introducing ref counting > especially ref counting on short lived objects many of which may not have > been counted and allowing cycles which may not have escaped a larger > nursery. > You may be right, but this is one of those tempting statements that needs to be confirmed with measurement. Intuitions on this sort of thing can be deceptive. Adolescent objects (objects newly promoted to the main heap) are recently written, and therefore have their M *and* N bits set (RC-immix didn't address this optimization). As such, writes to these objects are unimpeded and the "integrate" operation on the object is deferred. This mitigates both the ref counting cost and the likelihood of cycle creation. Not saying you're wrong. What I'm mainly saying is that the RC-immix encodings can be extended to buy you a kind of "shared nursery" in the general heap that survives until the next major cycle. The URC paper found best performance at very large nurseries ( 4 Meg was > pretty big for a Nursery 10-12 years ago) but such big nurseries per thread > scaled to today uses a lot of memory. Hence an interlocked single nursery > may be more memory efficient and avoid synch issues. > Nope. There is no way that an interlocked nursery is more efficient, though I agree that large nursery sizes are a problem. But as a middle position, you might do TLA over 32KB nursery chunks, and then interlocked operations to get new chunks. Lots of options here for amortizing costs. I should perhaps explain that I'm concerned to have an approach that will work on weakly ordered memories. That makes the cost of interlock even higher than it is on Intel processors, which is already astonishingly high. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
