[m5-users] S-NUCA: dealing with delayed MSHR allocation

Jeroen DR Sat, 27 Nov 2010 11:46:06 -0800

Hi,

I'm currently implementing S-NUCA, and I've ran into an issue with theway M5's MSHR and blocking mechanisms works while attempting todistribute incoming packets to several distinct UCA caches.

I've modelled the S-NUCA as a container of multiple individual regularUCA caches that serve as the banks, each with their own (smaller) hitlatency plus interconnect latency depending on which CPU is accessingthe bank. Since each port must be connected to one and only one peerport, I've created a bunch of internal SimpleTimingPorts to serve as thepeers of the individual banks' cpuSide and memSide ports.

The idea is that upon receiving a packet on the main CPU-side port, weexamine which bank to send the request to (based on the low-order bitsof the set index) and schedule it for departure from the associatedinternal port. Because each bank has its own interconnect latency, theoverall access time for banks closeby may be lower than that for banksthat are farther away.

An advantage of S-NUCA is that the entire cache needn't block if asingle bank blocks. This is supported by means of the internal ports, asany packet sent to a blocked bank may remain queued in the internal portuntil it can be serviced by the bank. Meanwhile, the main CPU-side portcan continue to accept packets for other banks. To implement this, Ihave the main CPU-side port distribute the packets to the internal portsand always signify success to its peer (unless of course all banks areblocked).

In the interest of validating my S-NUCA implementation with a singlebank against a regular UCA cache with the same parameters, I'vetemporarily set the interconnect latencies to 0 and modified theinternal ports to accept scheduling calls at curTick, as these normallyonly allow for scheduled packets to be sent at the next cycle. Thisbasically works by inserting the packet at the right position in thetransmitList in the exact same way it normally happens, and thenimmediately calling sendEvent->process() if the packet got inserted atthe front of the queue. This works well.

While digging through the codebase to find an explanation for some ofthe remaining timing differences I encountered, I found that the waythat a Cache's memside port sends packets to the bus and how thatinteracts with the MSHR chain is posing a problem for the way I'd likemy S-NUCA to work.

It basically comes down to the fact that a regular Cache's memside port,when it successfully sends an MSHR request downstream, relies on thedownstream device to already have processed the request and to haveallocated an MSHR in case of a miss. This is supported by the bus, whichbasically takes the packet sent by the Cache's memside port and directlyinvokes sendTiming() on the downstream device. If the downstream deviceis a cache, this causes it to perform its whole timingAccess() call,which checks for a hit or a miss and allocates an MSHR. In other words,when the cache's memside port receives the return value "true" for itssendTiming call, it relies on the fact that at that time an MSHR musthave already been allocated downstream if the request missed.

From studying the MSHR code, I understand that this is done in order tomaintain the downstreamPending flags across the MSHR chain. A cache hasno way of knowing whether its downstream device is going to be anothercache or main memory, so it also has no way of knowing whether the MSHRrequest will receive a response at this level (because it might miss inanother downstream cache). I also understand that for this reason, MSHRsare passed down in senderState, and that upon allocation, thedownstreamPending flag of the "parent" MSHR is set.

In this way, the mere fact of an MSHR getting allocated in a downstreamdevice will cause the downstreamPending flag on the current-level MSHRto be set. A regular Cache relies on this behaviour to determine whetherit is going to receive a response to the MSHR request it just sent outat this level; it can simply check whether the MSHR's downstreamPendingflag was set, because if the request missed at the downstream device,the downstream device must have been a cache which must have allocatedan MSHR, which must in turn have caused the downstreamPending flag in/this/ cache's MSHR to be set:



from Cache::MemSidePort::sendPacket:

            MSHR *mshr = dynamic_cast<MSHR*>(pkt->senderState);

bool success = sendTiming(pkt); // this assumes instantrequest processing by the peer


            waitingOnRetry = !success;
            if (waitingOnRetry) {
                DPRINTF(CachePort, "now waiting on a retry\n");
                if (!mshr->isForwardNoResponse()) {
                    delete pkt;
                }
            } else {

myCache()->markInService(mshr, pkt); // this assumesthe mshr->downstreamPending flag to have been correctly set (orcorrectly remained untouched) by the downstream device at this point

It's the markInService() call that will check whether thedownstreamPending flag is set. If it isn't set, then no MSHR wasallocated downstream, signifying that it will receive a response.

However, this only works if the call is indeed immediately processed bythe downstream device to check for a hit or a miss. In order to avoidblocking the entire cache, my S-NUCA implementation might return successbut have the packet queued in an internal port still, waiting fordeparture. The upper-level MSHR then checks its downstreamPending flag,and may incorrectly conclude that the request won't miss, even though itstill might when the packet is eventually sent from the internal port queue.

So at this point, I'm a bit at a loss what my options are. I tried tofind out what the downstreamPending flag is used for to see if I there'sanything I can do about this problem and I found the comments below, butI don't understand a word of what is going on here:


from MSHR::handleSnoop:

if (!inService || (pkt->isExpressSnoop() && downstreamPending)) {
        // Request has not been issued yet, or it's been issued
        // locally but is buffered unissued at some downstream cache
        // which is forwarding us this snoop.  Either way, the packet
        // we're snooping logically precedes this MSHR's request, so
        // the snoop has no impact on the MSHR, but must be processed
        // in the standard way by the cache.  The only exception is
        // that if we're an L2+ cache buffering an UpgradeReq from a
        // higher-level cache, and the snoop is invalidating, then our
        // buffered upgrades must be converted to read exclusives,
        // since the upper-level cache no longer has a valid copy.
        // That is, even though the upper-level cache got out on its
        // local bus first, some other invalidating transaction
        // reached the global bus before the upgrade did.
        if (pkt->needsExclusive()) {
            targets->replaceUpgrades();
            deferredTargets->replaceUpgrades();
        }

        return false;
    }

This is obviously related to cache-coherence and handling snoops at theupper-level cache, which I guess may occur at any time while a packet ispending in a internal port downstream in the S-NUCA, so I suspect thingsmay get hairy moving forward.

Another option I've considered is modifying the retry mechanism to nolonger be an opaque "yes, keep sending me stuff"/"no I'm blocked, waitfor my retry" but instead issue retries for particular address ranges.This would allow my S-NUCA to selectively issue retries to addressranges for which the internal bank is blocked, but then SimpleTimingPortwould also have to be modified to not just push the failed packet backto the front of the list and wait for an opaque retry, but continuesearching down the transmitList to find any ready packets to addressranges it hasn't been told are blocked yet. I think it's an interestingidea, but I imagine there's a whole slew of fairness issues with thatapproach.

On a side note, I've extensively documented all the behaviour Idiscussed previously in code, and would be more than willing tocontribute this back to the community. These timing issues turned out tobe very important for my purposes, but were hidden away behind 3 levelsof calls and an at first sight innocuous, entirely uncommentedif(!downstreamPending) check buried somewhere in MSHR::markInService, sosome comments in there about all these underlying assumptions definitelywouldn't hurt.


Cheers,
-- Jeroen

_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

[m5-users] S-NUCA: dealing with delayed MSHR allocation

Reply via email to