Did you forgot to attach the picture?
I have known for some time that a major time is spent in
RefCountingPtr::del(). It is called a lot many times and each call can end
up resulting in a call to delete(). I was thinking of moving to a system
in which memory for instructions is allocated statically when the
simulator starts and is reused, instead of making calls to new() and
delete() all the time. Is this what FastAlloc does?
Here is a profile result that I had obtained from gprof some time last
year --
4.30 845.85 845.85 176602458326 0.00 0.00
RefCountingPtr<BaseO3DynInst<O3CPUImpl> >::del()
4.26 1685.00 839.15 2820342034 0.00 0.00
DefaultFetch<O3CPUImpl>::fetch(bool&)
3.35 2343.36 658.36 2820342034 0.00 0.00
FullO3CPU<O3CPUImpl>::tick()
3.05 2943.80 600.44 2426497872 0.00 0.00
DefaultRename<O3CPUImpl>::renameInsts(short)
2.51 3437.20 493.40 2820342034 0.00 0.00
InstructionQueue<O3CPUImpl>::scheduleReadyInsts()
--
Nilay
On Tue, 29 May 2012, Ali Saidi wrote:
We recently took a look at the callgraph from gem5 with an O3 cpu
and it's pretty startling (see attached picture). The majority of time
is spent in memory management. The biggest chunk of this is in fetch
when instructions are built, however I assumed that FastAlloc would be
used. Nominally it would, except for that with both ARM and x86 the size
of a DynInst is > 512 bytes which is the max size FastAlloc handles.
Alpha seems to sneak under the limit, but either way it is astounding to
me that a single instruction requires over .5kB of storage. Doing some
quick math, if more than 64 dyninsts exist in the system they don't fit
in the L1 cache anymore. One thing we can do is increase the max size of
FastAlloc to 1kB, but it seems like we need to think about how to
slim-down a DynInst. I've looked over it and it seems like we loose
around 48 bytes to alignment issues, as members are scattered throughout
and the are Addrs, bools and then more Addrs. It seems like changing
some of the bools we currently have to setters/getters with an
underlying bitvector might help, and we might want to think about
packing the most used members together as apposed to the somewhat random
approach we have right now. You could nearly half it, if the processor
of interest doesn't have > 256 physical registers.
Furthermore,
looking at the picture there seem to be plenty of other places where
there are a lot of calls to new (teal-ish)/free(orange). It seems like
we could certainly make more use of FastAlloc, assuming it's actually
helping.
Thanks,
Ali
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev