Re: [bitc-dev] Improvements on library allocation

Jonathan S. Shapiro Tue, 16 Jul 2013 12:51:14 -0700

On Tue, Jul 16, 2013 at 12:07 AM, Bennie Kloosteman <[email protected]>wrote:


> I dont think the fact its not in the  kernel is a big reason either ...
> worst case  you could write a kernel module or driver and pass it the data
> to do the same thing ... less elegant  but would work .
>

Not so. As was explained elsewhere, this is about doing very high speed TLB
and page table entry invalidation. The problem that the Azul kernel changes
were intended to solve was the need for a kernel fast path to do this. By
fast path, think "O(100) asm instructions". Calling a kernel module is
several orders of magnitude too slow.


> Jonothan Singularity had like 8 different types of GCs did they learn
> anything about switching collectors ?
>

So first, I had no association with Singularity other than reading what has
been published. And of course I couldn't respond to your question directly
if I did. But I can say what I do know from general, public sources and
non-proprietary conversations. Note that all of what I'm about to say
predates the invention of continuous concurrent collectors.

Copying collection is best viewed as an optimization on mark-sweep. It is
an optimization for the case where allocations exceed liveness, having a
side benefit of induced cache locality in some cases.

Two-space collectors and generational collectors are best viewed as an
optimization on copying collectors, which take advantage of general
properties of object lifespans.

But the key words here are "optimization" and "general". As the saying
goes: "The difference between theory and practice is that in theory there
is no difference between theory and practice, but in practice there is." So
*no* real application exactly matches any particular set of optimization
assumptions. Worse, applications go through "regimes" in which their
behavior changes modally. For example, a lot of stuff is created during app
initialization that is retained for a long time; the generational intuition
doesn't tend to kick in until the application reaches steady state.

So of course there are pathological cases. The pathological case for
malloc/free, by the way, is an application that throws heap data away
nearly as fast as it allocates it.

But the other thing to say is that barring a new result from continuous
concurrent collection, there exist no single collector design today that
fits all application scenarios. This, rather than performance, is the soft
underbelly of the GC argument, and David Jeske's objections are obliquely
trying to point this out. There are also issues in trading *physical* RAM
for performance, which David has correctly identified. His numbers are
stale. The currently relevant multiplier is 3x RAM rather than 10x RAM, but
his fundamental point remains valid for pre-C4 collectors.

In any case, the two major collectors I know about for Singularity and
Midori are the generic CLR collector and the STOPLESS work (and successors)
that Bjarne Steensgard did. The CLR collector is a traditional generational
collector; I don't know what, if any, specialized support was added for
Singularity or Midori. Bjarne's work has been reasonably well described in
publications.

I can't comment on what Midori or Singularity did with their collectors,
but I can point out a problem that arises in the presence of shared memory
when different processes use different collectors:

There tends to be an intimate relationship between the choice of collector
and the design of the in-heap object header. Different collectors require
different kinds of markers or interlocks on the objects, and the object
header layout changes accordingly.

When two different processes share memory, they have to agree on the
semantics of the object headers well enough to cooperate. With distinct
collectors running in the two processes, this is *very* hard to achieve.

One of the reasons that the Singularity "shared heap" had to be referenced
by linear-typed references is that it eliminates the need for GC to visit
the shared heap.  Rust's counted pointers could be used instead, and in
some respects would be better. Rust's "owned" pointer is *not* a
substitute, because ownership can't be transferred. That's the main
difference between an owned pointer and a linear pointer.

Dealing with two heaps in a single process is tricky. Dealing with a heap
that is shared across process boundaries is even trickier.

That said who is to say that the Azul collector is not 50% slower.,.. there
> is very little information about this..
>

Actually, there is quite a lot about this. The key statement is that they
have measurements for heaps as big as 300GB indicating that *in principle* the
mutator cannot outpace the collector. What I do not remember is what
percentage of total multicore CPU capacity and what percentage of total
memory bandwidth is required to achieve that. I do remember that it was
surprisingly modest.

Im sure you can build some syncronised GC with safe spots with minimal
> pauses but can you do it and not trash cache hits and  generating lots of
> context switches..
>

No. But that's not how these newer collectors work. The newer collectors *
rely* on multicore for their success. Fortunately, multicore seems to be
the way of the future whether we want it or not.

So we now seem to be in a design space where two types of collectors remain
relevant:

1. Ones where total RAM is small enough for simple, conventional collection
to make sense.
2. Ones where multicore collectors make sense.


> Re concurrent GCs yes i prefer to use the term pausless.
>

Pausless really isn't the objective in concurrent collectors. Generational
collectors are effectively pauseless for the majority of applications. I'm
aware that there are exceptions, but the term "pauseless" has been
effectively co-opted in the literature to exclude those cases. Better to
use a new term.

The goal for the concurrent collectors is to be "stopless". That is: the
mutator is *never* halted, or at least, the worst case halt is measured in
microseconds and applies only to a single mutating thread. One of the big
problems in single-core concurrent collectors is the need to synchronize
all of the mutators on the collector. *That* turns out to be the big source
of delay in many designs. That's why the STOPLESS type of design matters.


> I also think the art of memory management is still usefull with GC .. but
> little used , where in C it is often  used .. if you test it and create too
> many objects reduce it .. eg if you have a non array linked list reuse your
> nodes , if you have vertex buffers or buffer[]  reuse them .
>

I agree completely that the art of memory management remains important in a
world of GC. But "reuse" is not what people generally mean when they talk
about memory management. Reuse is an idiom; memory management is a
mechanism.



Jonathan

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Improvements on library allocation

Reply via email to