> On May 29, 2015, 3:01 p.m., Joel Hestness wrote: > > I was buffering these comments, because this change has large implications > > and we still haven't discussed the differences between Ruby and classic > > memory interfaces. The patch submitter should be able to address questions > > like the ones I've raised so far (I don't feel any clarification has been > > provided yet). > > > > Once again, I did more digging: It appears that classic caches completely > > model MSHR queuing and write buffering, and they block requests from the > > core when these are full. This indicates that they accept multiple > > concurrent accesses to a single line (it would be nice to get confirmation > > on that from a classic user). Thus, it makes sense to allow the Ruby > > sequencer to accept multiple outstanding accesses to a single line > > concurrently to be consistent with classic. > > > > Overall, I have significant complaints about this patch, because it > > introduces buffering into the sequencer, which has the effect of > > implementing MSHR queuing that should probably be in the caches to be > > consistent with classic. This breaks the sequencer's thin shim abstraction > > between cores and cache controllers, appears to break no-RFO L1 caches, and > > side-steps cache resource constraints. Detailed comments are below. > > Steve Reinhardt wrote: > To follow up on Joel's high-level question: the classic cache model does > indeed combine requests to the same cache block using a finite set of MSHRs, > and blocks when the MSHRs are full. Note that this is useful at all levels > of the hierarchy, e.g., in a shared L2 you can end up combining requests from > different L1s that happen to target the same cache block. The general rule > in the classic hierarchy is that a single cache will have at most one > outstanding request for a given cache block at any point in time. When a > response arrives, the cache will satisfy as many of the buffered requests for > the block as it can. Note that the timing is sketchy here on the classic > side too; the latency of handling these (potentially multiple) requests is > not modeled. It's not clear how large of an effect that is. > > Note that the set of requests to a given cache block are processed in > order, both in the classic cache and (I believe) in this patch as well. > (Binh wrote this code originally, but I spent a fair amount of time cleaning > it up before it was posted the first time, though that was long enough ago I > don't remember all the details off the top of my head.) If stronger ordering > guarantees are required by the consistency model, it's the responsibility of > the CPU to hold off on issuing requests to the cache appropriately (as Joel > saw with the O3 CPU).
Thanks for the confirmation on that, Steve. I've thought more about this patch, and while I have issue with adding buffering in the Sequencer, I think this patch is headed in an appropriate direction if we figure out how to make it more flexible for protocol authors. Specifically, I feel that we should move the aliased access buffering out to a separate Ruby component, and slim the sequencer to pass all requests directly through to the top-level cache controller. This would allow a protocol author the following options: A) Use the new buffering component to block accesses like the current mainline sequencer B) Use the new buffering component to buffer them like this patch (but fix incorrect assumptions about line ownership) C) Remove the new component and send all accesses straight through to the cache controller, where the protocol author can do MSHR handling/queuing/coalescing as desired A major difference between the classic memory hierarchy and Ruby is that Ruby supports writing new coherence protocols rather than having a static protocol. To ease writing new coherence protocols, Ruby (and SLICC) makes significant and useful abstractions that decrease a protocol author's effort required to get an apparently-working coherence protocol. Examples of this include disallowing multiple in-flight requests to one cache line in the L1 caches (eliminates need to reason about MSHR queuing in L1s immediately), and providing a (optional) functionally coherent backing store (eliminates the need to make protocols 'data-correct' immediately). We need the mainline codebase to uphold and improve these sorts of simplifying Ruby abstractions. By adding a new component that parameterizes how to buffer aliased accesses, we would - again - be aiding the protocol author by introducing a path toward a working, high-performance protocol. By blocking aliased requests, the author can choose to set up such buffering in the requesting core and keep the L1 cache controller simple as a first cut when writing a protocol. By using buffering like this patch, the author might be able to get the performance of relaxed consistency without a complicated L1 cache (i.e. by allowing fake, unconstrained buffering). However, an author should be allowed to 'take the training wheels off' and handle multiple outstanding requests per line in the L1 controller. This aggressive protocol development is currently not even allowed by Ruby. ::comment:: I have realized that my gem5-gpu patches are, in effect, allowing multiple outstanding requests per line to the VI_hammer GPU L1s, but doing so in hacky ways that circumvent the sequencer callbacks. If I had the ability to send requests to the same line through the sequencer, I would fix the VI_hammer GPU L1 to do real MSHR queuing, since we know that is how NVIDIA parts work. Mainline gem5 needs to offer this to protocol authors. Finally, I think it makes most sense that this fake buffering be in a separate component: Aside from being 'training wheels' for a protocol author, providing buffering in a separate component can provide flexible comparisons between hierarchies. Currently, there is no good way to compare performance of a system using Ruby against the classic memory hierarchy, because they lack a common interface. For example, it would be interesting to know if classic (MESI) performs similarly to MESI_Two_Level, but MESI_Two_Level assumes a single outstanding L1 access per line. However, MESI_Two_Level would probably perform terribly for single-element-strided heap access from the inorder or O3 CPUs. By having a separate buffering component that can be used with any core and any cache hierarchy, we can enforce the same interfaces on both hierarchies to reasonably compare them. Would this new buffering component (or at least sequencer buffer parameterization) be a reasonable way to proceed with this patch? If so, I'll offer to help make it happen. - Joel ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://reviews.gem5.org/r/2787/#review6383 ----------------------------------------------------------- On May 26, 2015, 7:42 p.m., Tony Gutierrez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > http://reviews.gem5.org/r/2787/ > ----------------------------------------------------------- > > (Updated May 26, 2015, 7:42 p.m.) > > > Review request for Default. > > > Repository: gem5 > > > Description > ------- > > Changeset 10817:d010e6a8e783 > --------------------------- > ruby: Fixed pipeline squashes caused by aliased requests > > This patch was created by Bihn Pham during his internship at AMD. > > This patch fixes a very significant performance bug when using the O3 CPU > model > and Ruby. The issue was Ruby returned false when it received a request to the > same address that already has an outstanding request or when the memory is > blocked. As a result, O3 unnecessary squashed the pipeline and re-executed > instructions. This fix merges readRequestTable and writeRequestTable in > Sequencer into a single request table that keeps track of all requests and > allows multiple outstanding requests to the same address. This prevents O3 > from squashing the pipeline. > > > Diffs > ----- > > src/mem/ruby/system/Sequencer.hh df2aa91dba5b0f0baa351039f0802baad9ed8f1d > src/mem/ruby/system/Sequencer.cc df2aa91dba5b0f0baa351039f0802baad9ed8f1d > > Diff: http://reviews.gem5.org/r/2787/diff/ > > > Testing > ------- > > > Thanks, > > Tony Gutierrez > > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
