On 29.05.2012 10:58, Nilay Vaish wrote:
> Did you forgot to attach the picture? > > I have known for some time that a major time is spent in > RefCountingPtr::del(). It is called a lot many times and each call can end > up resulting in a call to delete(). I was thinking of moving to a system > in which memory for instructions is allocated statically when the > simulator starts and is reused, instead of making calls to new() and > delete() all the time. Is this what FastAlloc does? > > Here is a profile result that I had obtained from gprof some time last > year -- > > 4.30 845.85 845.85 176602458326 0.00 0.00 > RefCountingPtr >::del() > 4.26 1685.00 839.15 2820342034 0.00 0.00 > DefaultFetch::fetch(bool&) > 3.35 2343.36 658.36 2820342034 0.00 0.00 > FullO3CPU::tick() > 3.05 2943.80 600.44 2426497872 0.00 0.00 > DefaultRename::renameInsts(short) > 2.51 3437.20 493.40 2820342034 0.00 0.00 > InstructionQueue::scheduleReadyInsts() > > -- > Nilay > > On Tue, 29 May 2012, Ali Saidi wrote: > >> We recently took a look at the callgraph from gem5 with an O3 cpu and it's pretty startling (see attached picture). The majority of time is spent in memory management. The biggest chunk of this is in fetch when instructions are built, however I assumed that FastAlloc would be used. Nominally it would, except for that with both ARM and x86 the size of a DynInst is > 512 bytes which is the max size FastAlloc handles. Alpha seems to sneak under the limit, but either way it is astounding to me that a single instruction requires over .5kB of storage. Doing some quick math, if more than 64 dyninsts exist in the system they don't fit in the L1 cache anymore. One thing we can do is increase the max size of FastAlloc to 1kB, but it seems like we need to think about how to slim-down a DynInst. I've looked over it and it seems like we loose around 48 bytes to alignment issues, as members are scattered throughout and the are Addrs, bools and then more Addrs. It seems like changing some of the bools we currently have to setters/getters with an underlying bitvector might help, and we might want to think about packing the most used members together as apposed to the somewhat random approach we have right now. You could nearly half it, if the processor of interest doesn't have > 256 physical registers. Furthermore, looking at the picture there seem to be plenty of other places where there are a lot of calls to new (teal-ish)/free(orange). It seems like we could certainly make more use of FastAlloc, assuming it's actually helping. Thanks, Ali > > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
