> On Aug. 30, 2015, 6:01 p.m., Joel Hestness wrote: > > Hi guys, > > > > I'm not sure of the status of or plans for this patch, but I wanted to test > > it out and provide some feedback. I've merged and tested this with the > > current gem5 head (11061). First, there are a number of things broken with > > this patch, and if we're still interested in checking it in, it needs > > plenty of work. It took a fair amount of debugging before I was able to run > > with it. > > > > Second, after merging, I microbenchmarked a few common memory access > > patterns. The O3 CPU with Ruby certainly performs better than older > > versions of gem5 (I was testing changeset 10238). It appears that prior to > > this patch, the O3 CPU has been modified to fix the memory access squashing > > caused by Ruby sequencer blocking (not sure which changes fixed that), so > > the execution time of the microbenchmarks is now comparable between Ruby > > and classic without this patch. > > > > Further, I found this patch actually introduces many confusing issues and > > can reduce performance by up to 60%. It was very difficult to track down > > why performance suffered: By coalescing requests in the sequencer, the > > number of cache accesses changes, so the first problem was figuring out > > what an 'appropriate' change in number of cache accesses might be. After > > coming to an ok conclusion on that, I then found that the sequencer > > max_outstanding_requests needs to be configured appropriately to manage > > sequencer coalescing well. Specifically, if the max_outstanding_requests is > > less than the LSQ depth, the sequencer won't block the LSQ from issuing > > accesses to the same line, but will block when it is full. This reduces the > > MLP exposed to the caches compared to when the LSQ is blocked on > > outstanding lines and forced to expose accesses to separate lines. Setting > > the max_outstanding_requests greater than the LSQ depth fixes this, but > > this performance bug indicates that the coalescing in the sequencer > > introduces more non-trivial configuration to get reasonable MLP. > > > > Overall, I feel that this patch needs to be dropped: It does not appear to > > be necessary for performance, and it actually introduces problems with > > performance and debugging due to the cache access effects. > > Brad Beckmann wrote: > We still hope to check it in. The last time we ran our tests, this patch > proved to be very important for O3 performance. Even if the squashing > behavior is fixed, I think we saw benefits from avoiding unecessary mandatory > queue latency for aliased requests. We will rerun our tests to confirm. > > I'm surprised you encountered issues with the latest version of the > patch. I personally put a lot of effort into adding the disable feature that > you requested and I tested the enable/disable bit with the CPU models and > Ruby. It worked for me. > > The issue you point out with max_outstanding_requests seems to be a > problem with its prior default value. Why do you think increasing the value > is a "non-trivial configuration"? Shouldn't we have max_outstanding_requests > equal to the LSQ size? > > Nilay Vaish wrote: > I am acutally in favor of the idea of coalescing requests. This is > something that classic memory does as well. > And I read some paper on MSHRs maintaining primary and secondary > requests. I would like that we ultimately > merge this patch. I think we should split this patch into two parts: > > a. the first patch will combine the read and the write request tables > maintained by the sequencer. I think Joel > has no objections to that part. So should be possible to merge that > patch any time. > > b. the second part will provide each entry in the combined request table > with a list of packets instead of a > single packet. This part of the patch is what we need to discuss. > > -- I think we should drop the max outstanding request variable all > together. As both of you agree > the LSQ size should decide how many requests should be pending, > there is no need for the sequencer to > maintain its own limit. > > -- We also need to decide on how the statistical variables are updated. > With this patch, the requests buffered > in the sequencer will not update the stats maintained by the cache > controller. This becomes a problem > when one tries to compare the loads and stores issued by the cpu and > the loads and stores seen by the cache > controller. > > Brad Beckmann wrote: > I appreciate the concerns with statistics. At one point, this patch > actually included statistics for merged requests so that one could do the > math to verify that no requests were missing. Those stats where removed when > we dropped ruby.stats and they were never added back. Would you be ok with > adding counts in the Sequencer/RubyPort rather than trying to update the > controller stats? > > I will try to add those stats, but can we keep this patch as one rather > than split up? It has been a real pain to maintain and there is a lot of > interdependent parts to it. Splitting it up would essentially require > re-implementing it. We'll be lucky if we can find someone who as the time to > add the stats as you requested. > > Joel Hestness wrote: > >>From Brad: I'm surprised you encountered issues with the latest version > of the patch. I personally put a lot of effort into adding the disable > feature that you requested and I tested the enable/disable bit with the CPU > models and Ruby. It worked for me. > > Most problems with this patch as-is are merging issues with the current > head, where a handful of sequencer things have changed (e.g. LLSC and > L1/sequencer latency). Other problems include list iteration, dead variables > that should be stripped out, and the coalesce disable feature. I could open > these as issues for revision, but there are already a ton of open issues on > this patch. > > > >>From Brad: The issue you point out with max_outstanding_requests seems > to be a problem with its prior default value. Why do you think increasing > the value is a "non-trivial configuration"? Shouldn't we have > max_outstanding_requests equal to the LSQ size? > > The configuration is non-trivial in that there is nothing connecting > these parameters (LSQ depth with sequencer max_outstanding_requests), and > performance debugging this issue requires substantial understanding of both > the O3 CPU and Ruby. It took me roughly 2 hours of digging to understand it, > and I know both of these pieces of code. > > > >From Nilay: I think we should drop the max outstanding request variable > all together. As both of you agree the LSQ size should decide how many > requests should be pending, there is no need for the sequencer to maintain > its own limit. > > I disagree with removing max_outstanding_requests: Currently there are a > number of places in Ruby without flow control, so removing > max_outstanding_requests could bloat various unbounded buffers and cause > plenty of other problems. > > > While Nilay accurately captures my feelings about the specific changes, > my overall opinion is that this patch introduces too many hard questions and > too much uncertainty about performance correctness due to the CPU<->sequencer > interface change. I apologize for being blunt, but given that AMD may be too > busy to add stats (i.e. just a few lines of code), I don't trust that this > patch will be adequately performance validated before commit. Further, I > don't feel it should be the community's responsibility to fix any issues that > it may introduce. > > If we're going to decide to push these changes, can we at least figure > out a way for the patch to be shepherded to commit (e.g. someone volunteer to > pick up the ball and carry it through adequate revision and testing)? > > Nilay Vaish wrote: > I have posted a set of five patches that break this patch. The last > patch is different > from AMD is doing in this patch. Instead of fulfilling all the secondary > requests in one > go, the first secondary request (and hence now the primary request) is > issued to the > cache controller after the original primary request has completed. > > On max_outstanding_requests: I think we should not do control flow in > the Sequencer. Instead, > we should abort() when the number of requests in the sequencer exceeds > max_outstanding_requests. > This would force protocols / networks / testers / cores to do flow > control. > > Brad Beckmann wrote: > Joel, I thought we took care of the prior open issues with this patch. I > understand your concern with merging with a patch that touches a lot of > performance critical code. We've had to many of those types of merges as > well. I feel like I've personally put a lot of effort in trying to address > your concerns. So far, once I address your concerns, you seem to just have > more concerns. Please list out your current "hard questions" and "open > issues". I will try again to address them as much as I can. I just ask you > to not add more items to the list and please keep in mind that we have very > specific, cycle-by-cycle tests that compare the Ruby+O3 interface to real > hardware that strongly justify this patch. > > Nilay, you mention 5 patches, but I only noticed 3 patches that seem to > part of this patch: http://reviews.gem5.org/r/3096/, > http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are > the other 2 patches? Do they implement the disable coalescing feature? > > I believe that an important part of this patch is ensuring that merged > requests are completed in back-to-back cycles. If we require that all > requests must go across the mandatory queue to the L1 controller, I believe > that adds unecessary delay. We will rerun our tests to confirm. > > We absolutely should not force protocols to implement flow control. In > particular, the tester is designed to stress the system in ways that are not > practical with realistic flow control. Flow control is and should be an > option for those who want more accuracy.
> Nilay, you mention 5 patches, but I only noticed 3 patches that seem to part > of this patch: http://reviews.gem5.org/r/3096/, > http://reviews.gem5.org/r/3098/, http://reviews.gem5.org/r/3099/. What are > the other 2 patches? Do they implement the disable coalescing > feature? 394 and 3095. 3099 implements disable feature. But now I think we should not have disable at all since the controller is going to receive all the requests. > I believe that an important part of this patch is ensuring that merged > requests are completed in back-to-back cycles. If we require that all > > requests must go across the mandatory queue to the L1 controller, I believe > that adds unecessary delay. We will rerun our tests to confirm. The main problem, as I understand, is that aliased requests were nacked at the sequencer. This would not happen any longer. So you should not see pipline squashes. I do not think we can assume that the l1 controller would be ok with not seeing some of the requests. So I am not in favor of fulfilling multiple requests in one go from the sequencer. I would be fine with fulfilling multiple requests together in the l1 controller. - Nilay ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://reviews.gem5.org/r/2787/#review7077 ----------------------------------------------------------- On July 22, 2015, 6:15 p.m., Tony Gutierrez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > http://reviews.gem5.org/r/2787/ > ----------------------------------------------------------- > > (Updated July 22, 2015, 6:15 p.m.) > > > Review request for Default. > > > Repository: gem5 > > > Description > ------- > > Changeset 10887:1e05089bc991 > --------------------------- > ruby: Fixed pipeline squashes caused by aliased requests > > This patch was created by Bihn Pham during his internship at AMD. > > This patch fixes a very significant performance bug when using the O3 CPU > model > and Ruby. The issue was Ruby returned false when it received a request to the > same address that already has an outstanding request or when the memory is > blocked. As a result, O3 unnecessary squashed the pipeline and re-executed > instructions. This fix merges readRequestTable and writeRequestTable in > Sequencer into a single request table that keeps track of all requests and > allows multiple outstanding requests to the same address. This prevents O3 > from squashing the pipeline. > > > Diffs > ----- > > src/mem/ruby/system/Sequencer.hh ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18 > src/mem/ruby/system/Sequencer.cc ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18 > src/mem/ruby/system/Sequencer.py ebb3d0737aa72ec4fa24b6af9cf9a6b2a1109d18 > > Diff: http://reviews.gem5.org/r/2787/diff/ > > > Testing > ------- > > > Thanks, > > Tony Gutierrez > > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
