Hi,
I'm currently implementing S-NUCA, and I've ran into an issue with the
way M5's MSHR and blocking mechanisms works while attempting to
distribute incoming packets to several distinct UCA caches.
I've modelled the S-NUCA as a container of multiple individual regular
UCA caches that serve as the banks, each with their own (smaller) hit
latency plus interconnect latency depending on which CPU is accessing
the bank. Since each port must be connected to one and only one peer
port, I've created a bunch of internal SimpleTimingPorts to serve as the
peers of the individual banks' cpuSide and memSide ports.
The idea is that upon receiving a packet on the main CPU-side port, we
examine which bank to send the request to (based on the low-order bits
of the set index) and schedule it for departure from the associated
internal port. Because each bank has its own interconnect latency, the
overall access time for banks closeby may be lower than that for banks
that are farther away.
An advantage of S-NUCA is that the entire cache needn't block if a
single bank blocks. This is supported by means of the internal ports, as
any packet sent to a blocked bank may remain queued in the internal port
until it can be serviced by the bank. Meanwhile, the main CPU-side port
can continue to accept packets for other banks. To implement this, I
have the main CPU-side port distribute the packets to the internal ports
and always signify success to its peer (unless of course all banks are
blocked).
In the interest of validating my S-NUCA implementation with a single
bank against a regular UCA cache with the same parameters, I've
temporarily set the interconnect latencies to 0 and modified the
internal ports to accept scheduling calls at curTick, as these normally
only allow for scheduled packets to be sent at the next cycle. This
basically works by inserting the packet at the right position in the
transmitList in the exact same way it normally happens, and then
immediately calling sendEvent->process() if the packet got inserted at
the front of the queue. This works well.
While digging through the codebase to find an explanation for some of
the remaining timing differences I encountered, I found that the way
that a Cache's memside port sends packets to the bus and how that
interacts with the MSHR chain is posing a problem for the way I'd like
my S-NUCA to work.
It basically comes down to the fact that a regular Cache's memside port,
when it successfully sends an MSHR request downstream, relies on the
downstream device to already have processed the request and to have
allocated an MSHR in case of a miss. This is supported by the bus, which
basically takes the packet sent by the Cache's memside port and directly
invokes sendTiming() on the downstream device. If the downstream device
is a cache, this causes it to perform its whole timingAccess() call,
which checks for a hit or a miss and allocates an MSHR. In other words,
when the cache's memside port receives the return value "true" for its
sendTiming call, it relies on the fact that at that time an MSHR must
have already been allocated downstream if the request missed.
From studying the MSHR code, I understand that this is done in order to
maintain the downstreamPending flags across the MSHR chain. A cache has
no way of knowing whether its downstream device is going to be another
cache or main memory, so it also has no way of knowing whether the MSHR
request will receive a response at this level (because it might miss in
another downstream cache). I also understand that for this reason, MSHRs
are passed down in senderState, and that upon allocation, the
downstreamPending flag of the "parent" MSHR is set.
In this way, the mere fact of an MSHR getting allocated in a downstream
device will cause the downstreamPending flag on the current-level MSHR
to be set. A regular Cache relies on this behaviour to determine whether
it is going to receive a response to the MSHR request it just sent out
at this level; it can simply check whether the MSHR's downstreamPending
flag was set, because if the request missed at the downstream device,
the downstream device must have been a cache which must have allocated
an MSHR, which must in turn have caused the downstreamPending flag in
/this/ cache's MSHR to be set:
from Cache::MemSidePort::sendPacket:
MSHR *mshr = dynamic_cast<MSHR*>(pkt->senderState);
bool success = sendTiming(pkt); // this assumes instant
request processing by the peer
waitingOnRetry = !success;
if (waitingOnRetry) {
DPRINTF(CachePort, "now waiting on a retry\n");
if (!mshr->isForwardNoResponse()) {
delete pkt;
}
} else {
myCache()->markInService(mshr, pkt); // this assumes
the mshr->downstreamPending flag to have been correctly set (or
correctly remained untouched) by the downstream device at this point
}
It's the markInService() call that will check whether the
downstreamPending flag is set. If it isn't set, then no MSHR was
allocated downstream, signifying that it will receive a response.
However, this only works if the call is indeed immediately processed by
the downstream device to check for a hit or a miss. In order to avoid
blocking the entire cache, my S-NUCA implementation might return success
but have the packet queued in an internal port still, waiting for
departure. The upper-level MSHR then checks its downstreamPending flag,
and may incorrectly conclude that the request won't miss, even though it
still might when the packet is eventually sent from the internal port queue.
So at this point, I'm a bit at a loss what my options are. I tried to
find out what the downstreamPending flag is used for to see if I there's
anything I can do about this problem and I found the comments below, but
I don't understand a word of what is going on here:
from MSHR::handleSnoop:
if (!inService || (pkt->isExpressSnoop() && downstreamPending)) {
// Request has not been issued yet, or it's been issued
// locally but is buffered unissued at some downstream cache
// which is forwarding us this snoop. Either way, the packet
// we're snooping logically precedes this MSHR's request, so
// the snoop has no impact on the MSHR, but must be processed
// in the standard way by the cache. The only exception is
// that if we're an L2+ cache buffering an UpgradeReq from a
// higher-level cache, and the snoop is invalidating, then our
// buffered upgrades must be converted to read exclusives,
// since the upper-level cache no longer has a valid copy.
// That is, even though the upper-level cache got out on its
// local bus first, some other invalidating transaction
// reached the global bus before the upgrade did.
if (pkt->needsExclusive()) {
targets->replaceUpgrades();
deferredTargets->replaceUpgrades();
}
return false;
}
This is obviously related to cache-coherence and handling snoops at the
upper-level cache, which I guess may occur at any time while a packet is
pending in a internal port downstream in the S-NUCA, so I suspect things
may get hairy moving forward.
Another option I've considered is modifying the retry mechanism to no
longer be an opaque "yes, keep sending me stuff"/"no I'm blocked, wait
for my retry" but instead issue retries for particular address ranges.
This would allow my S-NUCA to selectively issue retries to address
ranges for which the internal bank is blocked, but then SimpleTimingPort
would also have to be modified to not just push the failed packet back
to the front of the list and wait for an opaque retry, but continue
searching down the transmitList to find any ready packets to address
ranges it hasn't been told are blocked yet. I think it's an interesting
idea, but I imagine there's a whole slew of fairness issues with that
approach.
On a side note, I've extensively documented all the behaviour I
discussed previously in code, and would be more than willing to
contribute this back to the community. These timing issues turned out to
be very important for my purposes, but were hidden away behind 3 levels
of calls and an at first sight innocuous, entirely uncommented
if(!downstreamPending) check buried somewhere in MSHR::markInService, so
some comments in there about all these underlying assumptions definitely
wouldn't hurt.
Cheers,
-- Jeroen
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users