[EMAIL PROTECTED] wrote:
Or it would make sense to use multi-frame register chunks. I kept locality of access in mind but somehow never spelled it out. But I *think* I mentioned 64kb as a good chunk size precisely because it fits well into the CPU cache - without ever specifying this as the reason.a) accessing a new register frame and context b) during DOD/GC
We have to address both areas to get rid of the majority of cache misses.
ad a)
For each level of recursion, we are allocating a new context structure and a new register frame. Half of these is coming from the recently implemented return continuation and register frame chaches. The other half has to be freshly allocated. We get exactly for every second function call L2 cache misses for both the context and the register structure.
Anyway, if you can pop both register frames -and- context structures, you won't run GC too often, and everything will nicely fit into the cache. Is the context structure a PMC now (and does it have to be, if the code doesn't specifically request access to it?)
Is there a way to find out how many misses came out from DoD, compared to register frames allocation?ad b)
The (currently broken) Parrot setting ARENA_DOD_FLAGS shows one
possibility to reduce cache misses in DOD. During a sweep (which runs
through all allocated object memory) the memory itself isn't touched,
just a nibble per object is used, which holds the relevant information
like "is_live".
I believe that you shouldn't litter (i.e. create an immediately GCable object) on each function call - at least not without generational collector specifically optimised to work with this. This would entail the first generation that fits into the CPU cache and copying out live objects from it. And this means copying GC for Parrot, something that (IMHO) would be highly nontrivial to retrofit.
Miro