Hi,

I'm currently implementing S-NUCA, and I've ran into an issue with the way M5's MSHR and blocking mechanisms works while attempting to distribute incoming packets to several distinct UCA caches.

I've modelled the S-NUCA as a container of multiple individual regular UCA caches that serve as the banks, each with their own (smaller) hit latency plus interconnect latency depending on which CPU is accessing the bank. Since each port must be connected to one and only one peer port, I've created a bunch of internal SimpleTimingPorts to serve as the peers of the individual banks' cpuSide and memSide ports.

The idea is that upon receiving a packet on the main CPU-side port, we examine which bank to send the request to (based on the low-order bits of the set index) and schedule it for departure from the associated internal port. Because each bank has its own interconnect latency, the overall access time for banks closeby may be lower than that for banks that are farther away.

An advantage of S-NUCA is that the entire cache needn't block if a single bank blocks. This is supported by means of the internal ports, as any packet sent to a blocked bank may remain queued in the internal port until it can be serviced by the bank. Meanwhile, the main CPU-side port can continue to accept packets for other banks. To implement this, I have the main CPU-side port distribute the packets to the internal ports and always signify success to its peer (unless of course all banks are blocked).

In the interest of validating my S-NUCA implementation with a single bank against a regular UCA cache with the same parameters, I've temporarily set the interconnect latencies to 0 and modified the internal ports to accept scheduling calls at curTick, as these normally only allow for scheduled packets to be sent at the next cycle. This basically works by inserting the packet at the right position in the transmitList in the exact same way it normally happens, and then immediately calling sendEvent->process() if the packet got inserted at the front of the queue. This works well.

While digging through the codebase to find an explanation for some of the remaining timing differences I encountered, I found that the way that a Cache's memside port sends packets to the bus and how that interacts with the MSHR chain is posing a problem for the way I'd like my S-NUCA to work.

It basically comes down to the fact that a regular Cache's memside port, when it successfully sends an MSHR request downstream, relies on the downstream device to already have processed the request and to have allocated an MSHR in case of a miss. This is supported by the bus, which basically takes the packet sent by the Cache's memside port and directly invokes sendTiming() on the downstream device. If the downstream device is a cache, this causes it to perform its whole timingAccess() call, which checks for a hit or a miss and allocates an MSHR. In other words, when the cache's memside port receives the return value "true" for its sendTiming call, it relies on the fact that at that time an MSHR must have already been allocated downstream if the request missed.

From studying the MSHR code, I understand that this is done in order to maintain the downstreamPending flags across the MSHR chain. A cache has no way of knowing whether its downstream device is going to be another cache or main memory, so it also has no way of knowing whether the MSHR request will receive a response at this level (because it might miss in another downstream cache). I also understand that for this reason, MSHRs are passed down in senderState, and that upon allocation, the downstreamPending flag of the "parent" MSHR is set.

In this way, the mere fact of an MSHR getting allocated in a downstream device will cause the downstreamPending flag on the current-level MSHR to be set. A regular Cache relies on this behaviour to determine whether it is going to receive a response to the MSHR request it just sent out at this level; it can simply check whether the MSHR's downstreamPending flag was set, because if the request missed at the downstream device, the downstream device must have been a cache which must have allocated an MSHR, which must in turn have caused the downstreamPending flag in /this/ cache's MSHR to be set:


from Cache::MemSidePort::sendPacket:

            MSHR *mshr = dynamic_cast<MSHR*>(pkt->senderState);

bool success = sendTiming(pkt); // this assumes instant request processing by the peer

            waitingOnRetry = !success;
            if (waitingOnRetry) {
                DPRINTF(CachePort, "now waiting on a retry\n");
                if (!mshr->isForwardNoResponse()) {
                    delete pkt;
                }
            } else {
myCache()->markInService(mshr, pkt); // this assumes the mshr->downstreamPending flag to have been correctly set (or correctly remained untouched) by the downstream device at this point
            }


It's the markInService() call that will check whether the downstreamPending flag is set. If it isn't set, then no MSHR was allocated downstream, signifying that it will receive a response.

However, this only works if the call is indeed immediately processed by the downstream device to check for a hit or a miss. In order to avoid blocking the entire cache, my S-NUCA implementation might return success but have the packet queued in an internal port still, waiting for departure. The upper-level MSHR then checks its downstreamPending flag, and may incorrectly conclude that the request won't miss, even though it still might when the packet is eventually sent from the internal port queue.

So at this point, I'm a bit at a loss what my options are. I tried to find out what the downstreamPending flag is used for to see if I there's anything I can do about this problem and I found the comments below, but I don't understand a word of what is going on here:

from MSHR::handleSnoop:

if (!inService || (pkt->isExpressSnoop() && downstreamPending)) {
        // Request has not been issued yet, or it's been issued
        // locally but is buffered unissued at some downstream cache
        // which is forwarding us this snoop.  Either way, the packet
        // we're snooping logically precedes this MSHR's request, so
        // the snoop has no impact on the MSHR, but must be processed
        // in the standard way by the cache.  The only exception is
        // that if we're an L2+ cache buffering an UpgradeReq from a
        // higher-level cache, and the snoop is invalidating, then our
        // buffered upgrades must be converted to read exclusives,
        // since the upper-level cache no longer has a valid copy.
        // That is, even though the upper-level cache got out on its
        // local bus first, some other invalidating transaction
        // reached the global bus before the upgrade did.
        if (pkt->needsExclusive()) {
            targets->replaceUpgrades();
            deferredTargets->replaceUpgrades();
        }

        return false;
    }

This is obviously related to cache-coherence and handling snoops at the upper-level cache, which I guess may occur at any time while a packet is pending in a internal port downstream in the S-NUCA, so I suspect things may get hairy moving forward.

Another option I've considered is modifying the retry mechanism to no longer be an opaque "yes, keep sending me stuff"/"no I'm blocked, wait for my retry" but instead issue retries for particular address ranges. This would allow my S-NUCA to selectively issue retries to address ranges for which the internal bank is blocked, but then SimpleTimingPort would also have to be modified to not just push the failed packet back to the front of the list and wait for an opaque retry, but continue searching down the transmitList to find any ready packets to address ranges it hasn't been told are blocked yet. I think it's an interesting idea, but I imagine there's a whole slew of fairness issues with that approach.

On a side note, I've extensively documented all the behaviour I discussed previously in code, and would be more than willing to contribute this back to the community. These timing issues turned out to be very important for my purposes, but were hidden away behind 3 levels of calls and an at first sight innocuous, entirely uncommented if(!downstreamPending) check buried somewhere in MSHR::markInService, so some comments in there about all these underlying assumptions definitely wouldn't hurt.

Cheers,
-- Jeroen
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Reply via email to