On 29.05.2012 10:58, Nilay Vaish wrote: 

> Did you forgot to attach
the picture?
> 
> I have known for some time that a major time is spent
in 
> RefCountingPtr::del(). It is called a lot many times and each call
can end 
> up resulting in a call to delete(). I was thinking of moving
to a system 
> in which memory for instructions is allocated statically
when the 
> simulator starts and is reused, instead of making calls to
new() and 
> delete() all the time. Is this what FastAlloc does?
> 
>
Here is a profile result that I had obtained from gprof some time last

> year --
> 
> 4.30 845.85 845.85 176602458326 0.00 0.00 
>
RefCountingPtr >::del()
> 4.26 1685.00 839.15 2820342034 0.00 0.00 
>
DefaultFetch::fetch(bool&)
> 3.35 2343.36 658.36 2820342034 0.00 0.00 
>
FullO3CPU::tick()
> 3.05 2943.80 600.44 2426497872 0.00 0.00 
>
DefaultRename::renameInsts(short)
> 2.51 3437.20 493.40 2820342034 0.00
0.00 
> InstructionQueue::scheduleReadyInsts()
> 
> --
> Nilay
> 
> On
Tue, 29 May 2012, Ali Saidi wrote:
> 
>> We recently took a look at the
callgraph from gem5 with an O3 cpu and it's pretty startling (see
attached picture). The majority of time is spent in memory management.
The biggest chunk of this is in fetch when instructions are built,
however I assumed that FastAlloc would be used. Nominally it would,
except for that with both ARM and x86 the size of a DynInst is > 512
bytes which is the max size FastAlloc handles. Alpha seems to sneak
under the limit, but either way it is astounding to me that a single
instruction requires over .5kB of storage. Doing some quick math, if
more than 64 dyninsts exist in the system they don't fit in the L1 cache
anymore. One thing we can do is increase the max size of FastAlloc to
1kB, but it seems like we need to think about how to slim-down a
DynInst. I've looked over it and it seems like we loose around 48 bytes
to alignment issues, as members are scattered throughout and the are
Addrs, bools and then more Addrs. It seems like changing some of the
bools we currently have to setters/getters with an underlying bitvector
might help, and we might want to think about packing the most used
members together as apposed to the somewhat random approach we have
right now. You could nearly half it, if the processor of interest
doesn't have > 256 physical registers. Furthermore, looking at the
picture there seem to be plenty of other places where there are a lot of
calls to new (teal-ish)/free(orange). It seems like we could certainly
make more use of FastAlloc, assuming it's actually helping. Thanks,
Ali
> 
> _______________________________________________
> gem5-dev
mailing list
> [email protected]
>
http://m5sim.org/mailman/listinfo/gem5-dev

 
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to