On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish <ni...@cs.wisc.edu> wrote:
> Some more data from the same simulation, this time from m5out/stats.txt. > The number of memory references is slightly more than 30,000,000 where as > the number of lookups in the cache is about 256,000,000. So that would be a > ratio of 1 : 8.5. I suspect that the reason for this might be that every > cache controller performs lookup for every memory reference. > Do you mean that I do a lookup in all 8 caches for each reference? Is this part of an assertion that's checking coherence invariants? It seems like that would be something that we'd want to only do in debug mode (or maybe only when explicitly enabled). > It seems to me that only one thread was running all the time which executed > mostly on processors 0 and 4. Is the rate at which sim tick is incremented > different from the processor clock rate? The simulation was allowed to run > for 200 billion sim ticks while the processor cycles is just 400 million. > Ticks are typically picoseconds, so assuming a 2 GHz CPU clock that looks right. > Given the way the processors behaved during this simulation, I think we > should profile a simulation in which all the processors are active most of > the time. > I'm not sure that would affect the simulator performance profile that much. Steve > > -- > Nilay > > > system.cpu0.not_idle_fraction 0.170783 # Percentage of > non-idle cycles > system.cpu0.numCycles 399995398 # number of cpu > cycles simulated > system.cpu0.num_insts 9504409 # Number of > instructions executed > system.cpu0.num_refs 2258648 # Number of memory > references > > system.cpu1.not_idle_fraction 0.002503 # Percentage of > non-idle cycles > system.cpu1.numCycles 398441993 # number of cpu > cycles simulated > system.cpu1.num_insts 212019 # Number of > instructions executed > system.cpu1.num_refs 71465 # Number of memory > references > > CPUs 2 and 3 behavior is same as 1. > > system.cpu4.not_idle_fraction 0.823388 # Percentage of > non-idle cycles > system.cpu4.numCycles 400000000 # number of cpu > cycles simulated > system.cpu4.num_insts 80686910 # Number of > instructions executed > system.cpu4.num_refs 28047398 # Number of memory > references > > system.cpu5.not_idle_fraction 0.000185 # Percentage of > non-idle cycles > system.cpu5.numCycles 399900914 # number of cpu > cycles simulated > system.cpu5.num_insts 8413 # Number of > instructions executed > system.cpu5.num_refs 2862 # Number of memory > references > > CPUs 6 and 7 also have stats similar to 5. > > > > > On Wed, 19 Jan 2011, Steve Reinhardt wrote: > > What's the deal with Histogram::add()? Either it's too slow or it's being >> called too much, I'd say, unless we're tracking some incredibly vital >> statistics there. Can you use the call graph part of the profile to find >> where most of the calls are coming from? >> >> Also, can you look at the stats and verify that the number of calls to >> CacheMemory::lookup() is not much larger than the total number of CPU >> memory >> references? >> >> Another thing to note is that even though PerfectSwitch::wakeup() is not a >> huge amount of time, the time per call is pretty large, so there may be >> opportunities for a lot of improvement there. >> >> Thanks, >> >> Steve >> >> On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish <ni...@cs.wisc.edu> wrote: >> >> I profiled m5 again, using the following command. >>> >>> ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py >>> --maxtick 200000000000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8 >>> --num-dir 8 >>> >>> Results have been copied below. CacheMemory::lookup() still consumes some >>> time but is lot less than before. PrefectSwitch can be a candidate for >>> re-design. But given that, PerfectSwitch is taking only 3% of the time, >>> it >>> would not yield that much gain. May be we need to look at these different >>> functions in a more holistic fashion. >>> >>> -- >>> Nilay >>> >>> >>> % cumulative self self total >>> time seconds seconds calls s/call s/call name >>> 7.29 35.89 35.89 606750284 0.00 0.00 Histogram::add(long >>> long) >>> 5.84 64.62 28.73 256533483 0.00 0.00 >>> CacheMemory::lookup(Address const&) >>> 4.56 87.06 22.44 124360139 0.00 0.00 >>> L1Cache_Controller::wakeup() >>> 3.88 106.18 19.12 121283110 0.00 0.00 >>> RubyPort::M5Port::recvTiming(Packet*) >>> 3.13 121.60 15.42 6875704 0.00 0.00 >>> PerfectSwitch::wakeup() >>> 3.00 136.38 14.78 39527382 0.00 0.00 >>> MemoryControl::executeCycle() >>> 2.95 150.91 14.53 90855686 0.00 0.00 >>> BaseSimpleCPU::preExecute() >>> 2.68 164.10 13.19 147750111 0.00 0.00 >>> MessageBuffer::enqueue(RefCountingPtr<Message>, long long) >>> 2.41 175.96 11.86 180281626 0.00 0.00 >>> RubyEventQueueNode::process() >>> 2.07 186.17 10.21 302054741 0.00 0.00 >>> EventQueue::serviceOne() >>> 2.03 196.15 9.98 121176116 0.00 0.00 >>> Sequencer::getRequestStatus(RubyRequest const&) >>> >>> >>> _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org > http://m5sim.org/mailman/listinfo/m5-dev >
_______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev