[EMAIL PROTECTED] wrote:

a) accessing a new register frame and context
b) during DOD/GC

We have to address both areas to get rid of the majority of cache
misses.

ad a)

For each level of recursion, we are allocating a new context structure
and a new register frame. Half of these is coming from the recently
implemented return continuation and register frame chaches. The other
half has to be freshly allocated. We get exactly for every second
function call L2 cache misses for both the context and the register
structure.



Or it would make sense to use multi-frame register chunks. I kept locality of access in mind but somehow never spelled it out. But I *think* I mentioned 64kb as a good chunk size precisely because it fits well into the CPU cache - without ever specifying this as the reason.

Anyway, if you can pop both register frames -and- context structures, you won't run GC too often, and everything will nicely fit into the cache. Is the context structure a PMC now (and does it have to be, if the code doesn't specifically request access to it?)

ad b)

The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one
possibility to reduce cache misses in DOD. During a sweep (which runs
through all allocated object memory) the memory itself isn't touched,
just a nibble per object is used, which holds the relevant information
like "is_live".


Is there a way to find out how many misses came out from DoD, compared to register frames allocation?

I believe that you shouldn't litter (i.e. create an immediately GCable object) on each function call - at least not without generational collector specifically optimised to work with this. This would entail the first generation that fits into the CPU cache and copying out live objects from it. And this means copying GC for Parrot, something that (IMHO) would be highly nontrivial to retrofit.

   Miro



Reply via email to