Re: [gem5-dev] Review Request 2787: ruby: Fixed pipeline squashes caused by aliased requests

Nilay Vaish Fri, 11 Sep 2015 14:14:14 -0700


> On Aug. 30, 2015, 6:01 p.m., Joel Hestness wrote:
> > Hi guys,
> > 
> > I'm not sure of the status of or plans for this patch, but I wanted to test 
> > it out and provide some feedback. I've merged and tested this with the 
> > current gem5 head (11061). First, there are a number of things broken with 
> > this patch, and if we're still interested in checking it in, it needs 
> > plenty of work. It took a fair amount of debugging before I was able to run 
> > with it.
> > 
> > Second, after merging, I microbenchmarked a few common memory access 
> > patterns. The O3 CPU with Ruby certainly performs better than older 
> > versions of gem5 (I was testing changeset 10238). It appears that prior to 
> > this patch, the O3 CPU has been modified to fix the memory access squashing 
> > caused by Ruby sequencer blocking (not sure which changes fixed that), so 
> > the execution time of the microbenchmarks is now comparable between Ruby 
> > and classic without this patch.
> > 
> > Further, I found this patch actually introduces many confusing issues and 
> > can reduce performance by up to 60%. It was very difficult to track down 
> > why performance suffered: By coalescing requests in the sequencer, the 
> > number of cache accesses changes, so the first problem was figuring out 
> > what an 'appropriate' change in number of cache accesses might be. After 
> > coming to an ok conclusion on that, I then found that the sequencer 
> > max_outstanding_requests needs to be configured appropriately to manage 
> > sequencer coalescing well. Specifically, if the max_outstanding_requests is 
> > less than the LSQ depth, the sequencer won't block the LSQ from issuing 
> > accesses to the same line, but will block when it is full. This reduces the 
> > MLP exposed to the caches compared to when the LSQ is blocked on 
> > outstanding lines and forced to expose accesses to separate lines. Setting 
> > the max_outstanding_requests greater than the LSQ depth fixes this, but 
> > this performance bug indicates that the coalescing in the sequencer 
> > introduces more non-trivial configuration to get reasonable MLP.
> > 
> > Overall, I feel that this patch needs to be dropped: It does not appear to 
> > be necessary for performance, and it actually introduces problems with 
> > performance and debugging due to the cache access effects.
> 
> Brad Beckmann wrote:
>     We still hope to check it in.  The last time we ran our tests, this patch 
> proved to be very important for O3 performance.  Even if the squashing 
> behavior is fixed, I think we saw benefits from avoiding unecessary mandatory 
> queue latency for aliased requests.  We will rerun our tests to confirm.
>     
>     I'm surprised you encountered issues with the latest version of the 
> patch.  I personally put a lot of effort into adding the disable feature that 
> you requested and I tested the enable/disable bit with the CPU models and 
> Ruby.  It worked for me.
>     
>     The issue you point out with max_outstanding_requests seems to be a 
> problem with its prior default value.  Why do you think increasing the value 
> is a "non-trivial configuration"?  Shouldn't we have max_outstanding_requests 
> equal to the LSQ size?
> 
> Nilay Vaish wrote:
>     I am acutally in favor of the idea of coalescing requests.  This is 
> something that classic memory does as well.
>     And I read some paper on MSHRs maintaining primary and secondary 
> requests.  I would like that we ultimately 
>     merge this patch.  I think we should split this patch into two parts:
>     
>     a. the first patch will combine the read and the write request tables 
> maintained by the sequencer.  I think Joel
>     has no objections to that part.  So should be possible to merge that 
> patch any time.
>     
>     b. the second part will provide each entry in the combined request table 
> with a list of packets instead of a
>     single packet.  This part of the patch is what we need to discuss.
>     
>       -- I think we should drop the max outstanding request variable all 
> together.  As both of you agree 
>          the LSQ size should decide how many requests should be pending, 
> there is no need for the sequencer to 
>          maintain its own limit.
>          
>       -- We also need to decide on how the statistical variables are updated. 
>  With this patch, the requests buffered
>          in the sequencer will not update the stats maintained by the cache 
> controller.  This becomes a problem
>          when one tries to compare the loads and stores issued by the cpu and 
> the loads and stores seen by the cache
>          controller.
> 
> Brad Beckmann wrote:
>     I appreciate the concerns with statistics.  At one point, this patch 
> actually included statistics for merged requests so that one could do the 
> math to verify that no requests were missing.  Those stats where removed when 
> we dropped ruby.stats and they were never added back.  Would you be ok with 
> adding counts in the Sequencer/RubyPort rather than trying to update the 
> controller stats?
>     
>     I will try to add those stats, but can we keep this patch as one rather 
> than split up?  It has been a real pain to maintain and there is a lot of 
> interdependent parts to it.  Splitting it up would essentially require 
> re-implementing it.  We'll be lucky if we can find someone who as the time to 
> add the stats as you requested.
> 
> Joel Hestness wrote:
>     >>From Brad: I'm surprised you encountered issues with the latest version 
> of the patch.  I personally put a lot of effort into adding the disable 
> feature that you requested and I tested the enable/disable bit with the CPU 
> models and Ruby.  It worked for me.
>     
>     Most problems with this patch as-is are merging issues with the current 
> head, where a handful of sequencer things have changed (e.g. LLSC and 
> L1/sequencer latency). Other problems include list iteration, dead variables 
> that should be stripped out, and the coalesce disable feature. I could open 
> these as issues for revision, but there are already a ton of open issues on 
> this patch.
>     
>     
>     >>From Brad: The issue you point out with max_outstanding_requests seems 
> to be a problem with its prior default value.  Why do you think increasing 
> the value is a "non-trivial configuration"?  Shouldn't we have 
> max_outstanding_requests equal to the LSQ size?
>     
>     The configuration is non-trivial in that there is nothing connecting 
> these parameters (LSQ depth with sequencer max_outstanding_requests), and 
> performance debugging this issue requires substantial understanding of both 
> the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, 
> and I know both of these pieces of code.
>     
>     
>     >From Nilay: I think we should drop the max outstanding request variable 
> all together.  As both of you agree the LSQ size should decide how many 
> requests should be pending, there is no need for the sequencer to maintain 
> its own limit.
>     
>     I disagree with removing max_outstanding_requests: Currently there are a 
> number of places in Ruby without flow control, so removing 
> max_outstanding_requests could bloat various unbounded buffers and cause 
> plenty of other problems.
>     
>     
>     While Nilay accurately captures my feelings about the specific changes, 
> my overall opinion is that this patch introduces too many hard questions and 
> too much uncertainty about performance correctness due to the CPU<->sequencer 
> interface change. I apologize for being blunt, but given that AMD may be too 
> busy to add stats (i.e. just a few lines of code), I don't trust that this 
> patch will be adequately performance validated before commit. Further, I 
> don't feel it should be the community's responsibility to fix any issues that 
> it may introduce.
>     
>     If we're going to decide to push these changes, can we at least figure 
> out a way for the patch to be shepherded to commit (e.g. someone volunteer to 
> pick up the ball and carry it through adequate revision and testing)?
> 
> Nilay Vaish wrote:
>     I have posted a set of five patches that break this patch.  The last 
> patch is different
>     from AMD is doing in this patch.  Instead of fulfilling all the secondary 
> requests in one 
>     go, the first secondary request (and hence now the primary request) is 
> issued to the 
>     cache controller after the original primary request has completed.
>     
>     On max_outstanding_requests:  I think we should not do control flow in 
> the Sequencer.  Instead,
>     we should abort() when the number of requests in the sequencer exceeds 
> max_outstanding_requests.
>     This would force protocols / networks / testers / cores to do flow 
> control.
> 
> Brad Beckmann wrote:
>     Joel, I thought we took care of the prior open issues with this patch.  I 
> understand your concern with merging with a patch that touches a lot of 
> performance critical code.  We've had to many of those types of merges as 
> well.  I feel like I've personally put a lot of effort in trying to address 
> your concerns.  So far, once I address your concerns, you seem to just have 
> more concerns.  Please list out your current "hard questions" and "open 
> issues".  I will try again to address them as much as I can.  I just ask you 
> to not add more items to the list and please keep in mind that we have very 
> specific, cycle-by-cycle tests that compare the Ruby+O3 interface to real 
> hardware that strongly justify this patch.
>     
>     Nilay, you mention 5 patches, but I only noticed 3 patches that seem to 
> part of this patch: http://reviews.gem5.org/r/3096/, 
> http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/.  What are 
> the other 2 patches?  Do they implement the disable coalescing feature?
>     
>     I believe that an important part of this patch is ensuring that merged 
> requests are completed in back-to-back cycles.  If we require that all 
> requests must go across the mandatory queue to the L1 controller, I believe 
> that adds unecessary delay.  We will rerun our tests to confirm.
>     
>     We absolutely should not force protocols to implement flow control.  In 
> particular, the tester is designed to stress the system in ways that are not 
> practical with realistic flow control.  Flow control is and should be an 
> option for those who want more accuracy.


> Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part 
> of this patch: http://reviews.gem5.org/r/3096/, 
> http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/.  What are 
> the other 2 patches?  Do they implement the disable coalescing 
> feature?

394 and 3095.  3099 implements disable feature.  But now I think we should not 
have disable at all since the controller is going to receive all the requests.

> I believe that an important part of this patch is ensuring that merged 
> requests are completed in back-to-back cycles.  If we require that all > 
> requests must go across the mandatory queue to the L1 controller, I believe 
> that adds unecessary delay.  We will rerun our tests to confirm.

The main problem, as I understand, is that aliased requests were nacked at the 
sequencer.
This would not happen any longer.  So you should not see pipline squashes.  I 
do not think we can
assume that the l1 controller would be ok with not seeing some of the requests. 
 So I am not in favor
of fulfilling multiple requests in one go from the sequencer. I would be fine 
with fulfilling 
multiple requests together in the l1 controller.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2787/#review7077
-----------------------------------------------------------


On July 22, 2015, 6:15 p.m., Tony Gutierrez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://reviews.gem5.org/r/2787/
> -----------------------------------------------------------
> 
> (Updated July 22, 2015, 6:15 p.m.)
> 
> 
> Review request for Default.
> 
> 
> Repository: gem5
> 
> 
> Description
> -------
> 
> Changeset 10887:1e05089bc991
> ---------------------------
> ruby: Fixed pipeline squashes caused by aliased requests
> 
> This patch was created by Bihn Pham during his internship at AMD.
> 
> This patch fixes a very significant performance bug when using the O3 CPU 
> model
> and Ruby. The issue was Ruby returned false when it received a request to the
> same address that already has an outstanding request or when the memory is
> blocked. As a result, O3 unnecessary squashed the pipeline and re-executed
> instructions. This fix merges readRequestTable and writeRequestTable in
> Sequencer into a single request table that keeps track of all requests and
> allows multiple outstanding requests to the same address. This prevents O3
> from squashing the pipeline.
> 
> 
> Diffs
> -----
> 
>   src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18 
>   src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18 
>   src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18 
> 
> Diff: http://reviews.gem5.org/r/2787/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Tony Gutierrez
> 
>

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request 2787: ruby: Fixed pipeline squashes caused by aliased requests

Reply via email to