Re: [m5-dev] Profile Results for Mesh Network

Nilay Vaish Mon, 24 Jan 2011 09:45:41 -0800

On Mon, 24 Jan 2011, Steve Reinhardt wrote:

On Sun, Jan 23, 2011 at 4:08 PM, Nilay Vaish <ni...@cs.wisc.edu> wrote:

On Sun, 23 Jan 2011, Korey Sewell wrote:

 In sendFetch(), it calls sendTiming() which would then call the recvTiming

on the cache port since those two should be binded as peers.

I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and
CacheController (?) relationship is working (right now at least), but the
relationship between sendTiming and recvTiming is the key concept that
connects 2 memory objects unless things have changed.

On Sun, Jan 23, 2011 at 3:51 PM, Nilay Vaish <ni...@cs.wisc.edu> wrote:

 I dug more in to the code today. There are three paths along which calls

are made to the RubyPort::M5Port::recvTiming(), which eventually results
in
calls to CacheMemory::lookup().

1. TimingSimpleCPU::sendFetch() - 140 million
2. TimingSimpleCPU::handleReadPacket() - 30 million
3. TimingSimpleCPU::handleWritePacket() - 18 million

The number of times last two functions are called is very close to the
total number of memory references (48 million) for all the cpus together.
The number of lookup() calls is about 392 million. If we take into
account
the calls to sendFetch(), then the ratio of number of lookup() calls to
that
of the number of requests pushed in to ruby reduces to 2 to 1, from an
earlier estimate of 8 to 1.

My question would be why does sendFetch() makes calls to recvTiming()?

Some more reading revealed that that sendFetch() is calling recvTiming for
instruction cache accesses. Whereas the other two calls (handleReadPacket
and handleWritePacket) are for data cache accesses.



Yes, that's right.  So there's probably no big win in trying to further
reduce the number of calls to lookup() in Ruby; the possibilities I see for
improvement are:
1. Adding an instruction buffer to SimpleCPU so we don't do a cache lookup
on *every* instruction fetch
2. Trying again to make the lookup() calls themselves faster (for example, a
lookup that hits the MRU block should really only take a handful of
instructions, while IIRC we were seeing much larger costs for the hash table
lookup)
3. Moving on to some other area (like the Histogram thing)

#1 is not a Ruby issue, and could well be different under x86 since (1) x86
has a byte-stream-oriented predecoder so it doesn't do a fetch per
instruction anyway and (2) you may have to worry about self-modifying code.
Gabe, how many bytes at a time does the x86 predecoder fetch?  If it
doesn't currently grab a cache line at a time, could it be made to do so,
and do you know if that would cause any issues with SMC?

Nilay, I'd appreciate your comments on #2 and whether you think that's worth
pursuing or should we move on to #3.

Steve

I now understand why the ratio is 2:1. Before every instructionfetch, the data cache is looked up to make sure that it doesnot contain the cache block. After that the instruction cache islooked up. Similarly, before any data access, the instruction cache islooked up. This is probably for correctly handling self modifying code.

Steve, we can try caching MRU cache block. We can also try replacing hashtable with a two dimensional array indexed using cache set and cache way.

There are calls to CacheMemory::isTagPresent() in Sequencer.cc. Thesecalls are made just before calls to setMRU(). I am thinking of foldingthese calls to isTagPresent() within setMRU() which callsCacheMemory::findTagInSet() anyway.


--
Nilay
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Profile Results for Mesh Network

Reply via email to