> On May 29, 2015, 3:01 p.m., Joel Hestness wrote:
> > I was buffering these comments, because this change has large implications 
> > and we still haven't discussed the differences between Ruby and classic 
> > memory interfaces. The patch submitter should be able to address questions 
> > like the ones I've raised so far (I don't feel any clarification has been 
> > provided yet).
> > 
> > Once again, I did more digging: It appears that classic caches completely 
> > model MSHR queuing and write buffering, and they block requests from the 
> > core when these are full. This indicates that they accept multiple 
> > concurrent accesses to a single line (it would be nice to get confirmation 
> > on that from a classic user). Thus, it makes sense to allow the Ruby 
> > sequencer to accept multiple outstanding accesses to a single line 
> > concurrently to be consistent with classic.
> > 
> > Overall, I have significant complaints about this patch, because it 
> > introduces buffering into the sequencer, which has the effect of 
> > implementing MSHR queuing that should probably be in the caches to be 
> > consistent with classic. This breaks the sequencer's thin shim abstraction 
> > between cores and cache controllers, appears to break no-RFO L1 caches, and 
> > side-steps cache resource constraints. Detailed comments are below.
> 
> Steve Reinhardt wrote:
>     To follow up on Joel's high-level question: the classic cache model does 
> indeed combine requests to the same cache block using a finite set of MSHRs, 
> and blocks when the MSHRs are full.  Note that this is useful at all levels 
> of the hierarchy, e.g., in a shared L2 you can end up combining requests from 
> different L1s that happen to target the same cache block.  The general rule 
> in the classic hierarchy is that a single cache will have at most one 
> outstanding request for a given cache block at any point in time.  When a 
> response arrives, the cache will satisfy as many of the buffered requests for 
> the block as it can.  Note that the timing is sketchy here on the classic 
> side too; the latency of handling these (potentially multiple) requests is 
> not modeled.  It's not clear how large of an effect that is.
>     
>     Note that the set of requests to a given cache block are processed in 
> order, both in the classic cache and (I believe) in this patch as well.  
> (Binh wrote this code originally, but I spent a fair amount of time cleaning 
> it up before it was posted the first time, though that was long enough ago I 
> don't remember all the details off the top of my head.)  If stronger ordering 
> guarantees are required by the consistency model, it's the responsibility of 
> the CPU to hold off on issuing requests to the cache appropriately (as Joel 
> saw with the O3 CPU).

Thanks for the confirmation on that, Steve.

I've thought more about this patch, and while I have issue with adding 
buffering in the Sequencer, I think this patch is headed in an appropriate 
direction if we figure out how to make it more flexible for protocol authors. 
Specifically, I feel that we should move the aliased access buffering out to a 
separate Ruby component, and slim the sequencer to pass all requests directly 
through to the top-level cache controller. This would allow a protocol author 
the following options:
 A) Use the new buffering component to block accesses like the current mainline 
sequencer
 B) Use the new buffering component to buffer them like this patch (but fix 
incorrect assumptions about line ownership)
 C) Remove the new component and send all accesses straight through to the 
cache controller, where the protocol author can do MSHR 
handling/queuing/coalescing as desired

A major difference between the classic memory hierarchy and Ruby is that Ruby 
supports writing new coherence protocols rather than having a static protocol. 
To ease writing new coherence protocols, Ruby (and SLICC) makes significant and 
useful abstractions that decrease a protocol author's effort required to get an 
apparently-working coherence protocol. Examples of this include disallowing 
multiple in-flight requests to one cache line in the L1 caches (eliminates need 
to reason about MSHR queuing in L1s immediately), and providing a (optional) 
functionally coherent backing store (eliminates the need to make protocols 
'data-correct' immediately). We need the mainline codebase to uphold and 
improve these sorts of simplifying Ruby abstractions.

By adding a new component that parameterizes how to buffer aliased accesses, we 
would - again - be aiding the protocol author by introducing a path toward a 
working, high-performance protocol. By blocking aliased requests, the author 
can choose to set up such buffering in the requesting core and keep the L1 
cache controller simple as a first cut when writing a protocol. By using 
buffering like this patch, the author might be able to get the performance of 
relaxed consistency without a complicated L1 cache (i.e. by allowing fake, 
unconstrained buffering). However, an author should be allowed to 'take the 
training wheels off' and handle multiple outstanding requests per line in the 
L1 controller. This aggressive protocol development is currently not even 
allowed by Ruby.

::comment:: I have realized that my gem5-gpu patches are, in effect, allowing 
multiple outstanding requests per line to the VI_hammer GPU L1s, but doing so 
in hacky ways that circumvent the sequencer callbacks. If I had the ability to 
send requests to the same line through the sequencer, I would fix the VI_hammer 
GPU L1 to do real MSHR queuing, since we know that is how NVIDIA parts work. 
Mainline gem5 needs to offer this to protocol authors.

Finally, I think it makes most sense that this fake buffering be in a separate 
component: Aside from being 'training wheels' for a protocol author, providing 
buffering in a separate component can provide flexible comparisons between 
hierarchies. Currently, there is no good way to compare performance of a system 
using Ruby against the classic memory hierarchy, because they lack a common 
interface. For example, it would be interesting to know if classic (MESI) 
performs similarly to MESI_Two_Level, but MESI_Two_Level assumes a single 
outstanding L1 access per line. However, MESI_Two_Level would probably perform 
terribly for single-element-strided heap access from the inorder or O3 CPUs. By 
having a separate buffering component that can be used with any core and any 
cache hierarchy, we can enforce the same interfaces on both hierarchies to 
reasonably compare them.

Would this new buffering component (or at least sequencer buffer 
parameterization) be a reasonable way to proceed with this patch? If so, I'll 
offer to help make it happen.


- Joel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review6383
-----------------------------------------------------------


On May 26, 2015, 7:42 p.m., Tony Gutierrez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://reviews.gem5.org/r/2787/
> -----------------------------------------------------------
> 
> (Updated May 26, 2015, 7:42 p.m.)
> 
> 
> Review request for Default.
> 
> 
> Repository: gem5
> 
> 
> Description
> -------
> 
> Changeset 10817:d010e6a8e783
> ---------------------------
> ruby: Fixed pipeline squashes caused by aliased requests
> 
> This patch was created by Bihn Pham during his internship at AMD.
> 
> This patch fixes a very significant performance bug when using the O3 CPU 
> model
> and Ruby. The issue was Ruby returned false when it received a request to the
> same address that already has an outstanding request or when the memory is
> blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
> instructions. This fix merges readRequestTable and writeRequestTable in
> Sequencer into a single request table that keeps track of all requests and
> allows multiple outstanding requests to the same address. This prevents O3
> from squashing the pipeline.
> 
> 
> Diffs
> -----
> 
>   src/mem/ruby/system/Sequencer.hh df2aa91dba5b0f0baa351039f0802baad9ed8f1d 
>   src/mem/ruby/system/Sequencer.cc df2aa91dba5b0f0baa351039f0802baad9ed8f1d 
> 
> Diff: http://reviews.gem5.org/r/2787/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Tony Gutierrez
> 
>

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to