Re: [m5-users] S-NUCA: dealing with delayed MSHR allocation

Steve Reinhardt Sat, 27 Nov 2010 21:27:50 -0800

Thanks for the detailed analysis... this code is complex, I agree; I wrote
it, and when I have to get back into it to fix a bug it always takes a while
to recall all the detailed interactions.


I don't completely follow why downstreamPending is causing problems for you,
but I can elaborate a little on its purpose, which I hope may help.  The
protocol assumes an arbitrarily deep hierarchy of busses, and conflicting
accesses are ordered according to which access is the first to reach the
nearest bus that's common to both requesters.  Also, as an invalidating
request hits each level of bus, the invalidation is atomically and
instantaneously broadcast to all caches above that bus (using the
expressSnoop feature).

The basic problem is that once a cache has sent out a request and is
awaiting the response, it can snoop an invalidation that's been propagated
upward, and it absolutely needs to know whether that invalidation belongs to
a request that precedes or succeeds its own request.  However, that isn't
easily determined, since it takes time both for the request to propagate
downward until it is satisfied, and for the response to propagate back
upwards.  So the invalidation could have arrived at a lower-level bus before
the cache's request got there (meaning the invalidation comes first), or the
cache's request could have come first but the invalidation could have passed
the response on the way back up (since the invalidations use the magic
expressSnoop path).  The downstreamPending flags and the
clearDownstreamPending() mechanism solve this problem by providing an
instantaneous mechanism to notify all the upstream caches when a request is
satisfied.  If an invalidation is snooped while downstreamPending is true,
the invalidation is ordered before the request; if downstreamPending is
false, the request has already been satisfied at some cache level and hence
is ordered before the invalidation.

Now that I've gone through all that, it seems like the problem is that the
requesting cache's downstreamPending is false by default, and is only set to
true if the downstream cache misses; you kind of want the opposite, which is
that downstreamPending is true by default and only cleared if the downstream
cache hits.  You could get that effect by setting downstreamPending in the
requesting cache's MSHR when you buffer the request, then explicitly calling
clearDownstreamPending() if it hits in the downstream cache, or perhaps you
could actually change the default setting in the cache to avoid thse
contortions in your code.

The whole flow control/retry interface is one that we've gone around about
quite a bit, but despite the limitations of the current setup we've never
come up with a better replacement.  As you point out, doing something like
using address ranges would possibly be a big change in a number of places
(though most devices do derive from SimpleTimingPort, so maybe it's more
localized there than it seems).

Incidentally, the complexity and fragility of the coherence protocol is one
of the reasons we're integrating the GEMS Ruby memory model, to provide a
more flexible memory system.  Unfortunately that's still in progress, and
right now Ruby is also quite a bit slower, but we're working on that.

Steve

On Sat, Nov 27, 2010 at 11:45 AM, Jeroen DR
<[email protected]<voetsjoeba%[email protected]>
> wrote:

>  Hi,
>
> I'm currently implementing S-NUCA, and I've ran into an issue with the way
> M5's MSHR and blocking mechanisms works while attempting to distribute
> incoming packets to several distinct UCA caches.
>
> I've modelled the S-NUCA as a container of multiple individual regular UCA
> caches that serve as the banks, each with their own (smaller) hit latency
> plus interconnect latency depending on which CPU is accessing the bank.
> Since each port must be connected to one and only one peer port, I've
> created a bunch of internal SimpleTimingPorts to serve as the peers of the
> individual banks' cpuSide and memSide ports.
>
> The idea is that upon receiving a packet on the main CPU-side port, we
> examine which bank to send the request to (based on the low-order bits of
> the set index) and schedule it for departure from the associated internal
> port. Because each bank has its own interconnect latency, the overall access
> time for banks closeby may be lower than that for banks that are farther
> away.
>
> An advantage of S-NUCA is that the entire cache needn't block if a single
> bank blocks. This is supported by means of the internal ports, as any packet
> sent to a blocked bank may remain queued in the internal port until it can
> be serviced by the bank. Meanwhile, the main CPU-side port can continue to
> accept packets for other banks. To implement this, I have the main CPU-side
> port distribute the packets to the internal ports and always signify success
> to its peer (unless of course all banks are blocked).
>
> In the interest of validating my S-NUCA implementation with a single bank
> against a regular UCA cache with the same parameters, I've temporarily set
> the interconnect latencies to 0 and modified the internal ports to accept
> scheduling calls at curTick, as these normally only allow for scheduled
> packets to be sent at the next cycle. This basically works by inserting the
> packet at the right position in the transmitList in the exact same way it
> normally happens, and then immediately calling sendEvent->process() if the
> packet got inserted at the front of the queue. This works well.
>
> While digging through the codebase to find an explanation for some of the
> remaining timing differences I encountered, I found that the way that a
> Cache's memside port sends packets to the bus and how that interacts with
> the MSHR chain is posing a problem for the way I'd like my S-NUCA to work.
>
> It basically comes down to the fact that a regular Cache's memside port,
> when it successfully sends an MSHR request downstream, relies on the
> downstream device to already have processed the request and to have
> allocated an MSHR in case of a miss. This is supported by the bus, which
> basically takes the packet sent by the Cache's memside port and directly
> invokes sendTiming() on the downstream device. If the downstream device is a
> cache, this causes it to perform its whole timingAccess() call, which checks
> for a hit or a miss and allocates an MSHR. In other words, when the cache's
> memside port receives the return value "true" for its sendTiming call, it
> relies on the fact that at that time an MSHR must have already been
> allocated downstream if the request missed.
>
> From studying the MSHR code, I understand that this is done in order to
> maintain the downstreamPending flags across the MSHR chain. A cache has no
> way of knowing whether its downstream device is going to be another cache or
> main memory, so it also has no way of knowing whether the MSHR request will
> receive a response at this level (because it might miss in another
> downstream cache). I also understand that for this reason, MSHRs are passed
> down in senderState, and that upon allocation, the downstreamPending flag of
> the "parent" MSHR is set.
>
> In this way, the mere fact of an MSHR getting allocated in a downstream
> device will cause the downstreamPending flag on the current-level MSHR to be
> set. A regular Cache relies on this behaviour to determine whether it is
> going to receive a response to the MSHR request it just sent out at this
> level; it can simply check whether the MSHR's downstreamPending flag was
> set, because if the request missed at the downstream device, the downstream
> device must have been a cache which must have allocated an MSHR, which must
> in turn have caused the downstreamPending flag in *this* cache's MSHR to
> be set:
>
>
> from Cache::MemSidePort::sendPacket:
>
>             MSHR *mshr = dynamic_cast<MSHR*>(pkt->senderState);
>
>             bool success = sendTiming(pkt); // this assumes instant request
> processing by the peer
>
>             waitingOnRetry = !success;
>             if (waitingOnRetry) {
>                 DPRINTF(CachePort, "now waiting on a retry\n");
>                 if (!mshr->isForwardNoResponse()) {
>                     delete pkt;
>                 }
>             } else {
>                 myCache()->markInService(mshr, pkt); // this assumes the
> mshr->downstreamPending flag to have been correctly set (or correctly
> remained untouched) by the downstream device at this point
>             }
>
>
> It's the markInService() call that will check whether the downstreamPending
> flag is set. If it isn't set, then no MSHR was allocated downstream,
> signifying that it will receive a response.
>
> However, this only works if the call is indeed immediately processed by the
> downstream device to check for a hit or a miss. In order to avoid blocking
> the entire cache, my S-NUCA implementation might return success but have the
> packet queued in an internal port still, waiting for departure. The
> upper-level MSHR then checks its downstreamPending flag, and may incorrectly
> conclude that the request won't miss, even though it still might when the
> packet is eventually sent from the internal port queue.
>
> So at this point, I'm a bit at a loss what my options are. I tried to find
> out what the downstreamPending flag is used for to see if I there's anything
> I can do about this problem and I found the comments below, but I don't
> understand a word of what is going on here:
>
> from MSHR::handleSnoop:
>
> if (!inService || (pkt->isExpressSnoop() && downstreamPending)) {
>         // Request has not been issued yet, or it's been issued
>         // locally but is buffered unissued at some downstream cache
>         // which is forwarding us this snoop.  Either way, the packet
>         // we're snooping logically precedes this MSHR's request, so
>         // the snoop has no impact on the MSHR, but must be processed
>         // in the standard way by the cache.  The only exception is
>         // that if we're an L2+ cache buffering an UpgradeReq from a
>         // higher-level cache, and the snoop is invalidating, then our
>         // buffered upgrades must be converted to read exclusives,
>         // since the upper-level cache no longer has a valid copy.
>         // That is, even though the upper-level cache got out on its
>         // local bus first, some other invalidating transaction
>         // reached the global bus before the upgrade did.
>         if (pkt->needsExclusive()) {
>             targets->replaceUpgrades();
>             deferredTargets->replaceUpgrades();
>         }
>
>         return false;
>     }
>
> This is obviously related to cache-coherence and handling snoops at the
> upper-level cache, which I guess may occur at any time while a packet is
> pending in a internal port downstream in the S-NUCA, so I suspect things may
> get hairy moving forward.
>
> Another option I've considered is modifying the retry mechanism to no
> longer be an opaque "yes, keep sending me stuff"/"no I'm blocked, wait for
> my retry" but instead issue retries for particular address ranges. This
> would allow my S-NUCA to selectively issue retries to address ranges for
> which the internal bank is blocked, but then SimpleTimingPort would also
> have to be modified to not just push the failed packet back to the front of
> the list and wait for an opaque retry, but continue searching down the
> transmitList to find any ready packets to address ranges it hasn't been told
> are blocked yet. I think it's an interesting idea, but I imagine there's a
> whole slew of fairness issues with that approach.
>
> On a side note, I've extensively documented all the behaviour I discussed
> previously in code, and would be more than willing to contribute this back
> to the community. These timing issues turned out to be very important for my
> purposes, but were hidden away behind 3 levels of calls and an at first
> sight innocuous, entirely uncommented if(!downstreamPending) check buried
> somewhere in MSHR::markInService, so some comments in there about all these
> underlying assumptions definitely wouldn't hurt.
>
> Cheers,
> -- Jeroen
>
> _______________________________________________
> m5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>

_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Re: [m5-users] S-NUCA: dealing with delayed MSHR allocation

Reply via email to