Re: [m5-dev] Profile Results for Mesh Network

Steve Reinhardt Wed, 19 Jan 2011 16:51:11 -0800

On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish <ni...@cs.wisc.edu> wrote:


> Some more data from the same simulation, this time from m5out/stats.txt.
> The number of memory references is slightly more than 30,000,000 where as
> the number of lookups in the cache is about 256,000,000. So that would be a
> ratio of 1 : 8.5. I suspect that the reason for this might be that every
> cache controller performs lookup for every memory reference.
>

Do you mean that I do a lookup in all 8 caches for each reference?  Is this
part of an assertion that's checking coherence invariants?  It seems like
that would be something that we'd want to only do in debug mode (or maybe
only when explicitly enabled).


> It seems to me that only one thread was running all the time which executed
> mostly on processors 0 and 4. Is the rate at which sim tick is incremented
> different from the processor clock rate? The simulation was allowed to run
> for 200 billion sim ticks while the processor cycles is just 400 million.
>

Ticks are typically picoseconds, so assuming a 2 GHz CPU clock that looks
right.


> Given the way the processors behaved during this simulation, I think we
> should profile a simulation in which all the processors are active most of
> the time.
>

I'm not sure that would affect the simulator performance profile that much.

Steve


>
> --
> Nilay
>
>
> system.cpu0.not_idle_fraction                0.170783 # Percentage of
> non-idle cycles
> system.cpu0.numCycles                       399995398 # number of cpu
> cycles simulated
> system.cpu0.num_insts                         9504409 # Number of
> instructions executed
> system.cpu0.num_refs                          2258648 # Number of memory
> references
>
> system.cpu1.not_idle_fraction                0.002503 # Percentage of
> non-idle cycles
> system.cpu1.numCycles                       398441993 # number of cpu
> cycles simulated
> system.cpu1.num_insts                          212019 # Number of
> instructions executed
> system.cpu1.num_refs                            71465 # Number of memory
> references
>
> CPUs 2 and 3 behavior is same as 1.
>
> system.cpu4.not_idle_fraction                0.823388 # Percentage of
> non-idle cycles
> system.cpu4.numCycles                       400000000 # number of cpu
> cycles simulated
> system.cpu4.num_insts                        80686910 # Number of
> instructions executed
> system.cpu4.num_refs                         28047398 # Number of memory
> references
>
> system.cpu5.not_idle_fraction                0.000185 # Percentage of
> non-idle cycles
> system.cpu5.numCycles                       399900914 # number of cpu
> cycles simulated
> system.cpu5.num_insts                            8413 # Number of
> instructions executed
> system.cpu5.num_refs                             2862 # Number of memory
> references
>
> CPUs 6 and 7 also have stats similar to 5.
>
>
>
>
> On Wed, 19 Jan 2011, Steve Reinhardt wrote:
>
>  What's the deal with Histogram::add()?  Either it's too slow or it's being
>> called too much, I'd say, unless we're tracking some incredibly vital
>> statistics there.  Can you use the call graph part of the profile to find
>> where most of the calls are coming from?
>>
>> Also, can you look at the stats and verify that the number of calls to
>> CacheMemory::lookup() is not much larger than the total number of CPU
>> memory
>> references?
>>
>> Another thing to note is that even though PerfectSwitch::wakeup() is not a
>> huge amount of time, the time per call is pretty large, so there may be
>> opportunities for a lot of improvement there.
>>
>> Thanks,
>>
>> Steve
>>
>> On Wed, Jan 19, 2011 at 9:20 AM, Nilay Vaish <ni...@cs.wisc.edu> wrote:
>>
>>  I profiled m5 again, using the following command.
>>>
>>> ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py
>>> --maxtick 200000000000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8
>>> --num-dir 8
>>>
>>> Results have been copied below. CacheMemory::lookup() still consumes some
>>> time but is lot less than before. PrefectSwitch can be a candidate for
>>> re-design. But given that, PerfectSwitch is taking only 3% of the time,
>>> it
>>> would not yield that much gain. May be we need to look at these different
>>> functions in a more holistic fashion.
>>>
>>> --
>>> Nilay
>>>
>>>
>>>  %   cumulative   self              self     total
>>>  time   seconds   seconds    calls   s/call   s/call  name
>>>  7.29     35.89    35.89 606750284     0.00     0.00  Histogram::add(long
>>> long)
>>>  5.84     64.62    28.73 256533483     0.00     0.00
>>> CacheMemory::lookup(Address const&)
>>>  4.56     87.06    22.44 124360139     0.00     0.00
>>> L1Cache_Controller::wakeup()
>>>  3.88    106.18    19.12 121283110     0.00     0.00
>>> RubyPort::M5Port::recvTiming(Packet*)
>>>  3.13    121.60    15.42  6875704     0.00     0.00
>>> PerfectSwitch::wakeup()
>>>  3.00    136.38    14.78 39527382     0.00     0.00
>>> MemoryControl::executeCycle()
>>>  2.95    150.91    14.53 90855686     0.00     0.00
>>> BaseSimpleCPU::preExecute()
>>>  2.68    164.10    13.19 147750111     0.00     0.00
>>> MessageBuffer::enqueue(RefCountingPtr<Message>, long long)
>>>  2.41    175.96    11.86 180281626     0.00     0.00
>>> RubyEventQueueNode::process()
>>>  2.07    186.17    10.21 302054741     0.00     0.00
>>> EventQueue::serviceOne()
>>>  2.03    196.15     9.98 121176116     0.00     0.00
>>> Sequencer::getRequestStatus(RubyRequest const&)
>>>
>>>
>>>  _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev
>

_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Profile Results for Mesh Network

Reply via email to