Hello, To my understanding, wbDepth represents a kind of "average and effective execution stage depth": wbWidth*wbDepth represents the maximum allowed in-flight instructions in the EXE, i.e. instructions that issued but did not writeback yet.
Given that, I agree that such buffers would be better used if they are distributed among the FUs, because it would represent the FU effective depth. I call it "effective depth" because in real hardware the FP functional unit (FU) may have 4 stages (then its depth is 4), but in gem5 we would setup a FMAC with a latency greater than 4, say 8 cycles, because in real hardware there may be more than one pipe inside a FU (e.g FP ADD, FP MUL, etc). Then, we should set up the FP FU with an effective EXE depth of 8 to correctly simulate them. For while, I always setup the wbDepth = max(opLat/issueCycles), which means that the issueWidth, the issueCycle and the opLat are generally dictating the maximum number of in-flight instructions in the EXE (wbDepth should have no major influence). Thanks for this discussion. Regards, 2014-05-13 16:32 GMT+02:00, Mitch Hayenga via gem5-users <gem5-users@gem5.org>: > I actually wrote a patch a while back (apparently Feb 20) that fixed the > load squash issue. I kind of abandoned it, but it was able to run a few > benchmarks (never ran the regression tests on it). I'll revive that and > see if it passes the regression tests. > > All it did was force the load to be repetitively replayed until it was > successfully not blocked, rather than squashing the entire pipeline. I > remember incrWB() and decrWb() was the most annoying part of writing it. > > As a side note, I've found generally that increasing tgts_per_mshr to > something unlikely to get hit largely eliminates the issue (this is why I > abandoned the patch). You are still limiting the number of outstanding > cache lines to a specific number via the number of MSHRs, but don't squash > just because a bunch of loads all accessed the same line. This is > probably a good temporary solution. > > > > On Tue, May 13, 2014 at 3:09 AM, Vamsi Krishna via gem5-users < > gem5-users@gem5.org> wrote: > >> Hello All, >> >> As Paul was mentioning, I tried to come up with small analysis on how the >> number of writeback buffers affect performance of PARSEC benchmarks when >> increased by 5x the default size. I found out that 2-wide processor >> improved by 22% , 4-wide processor by 7% and 8-wide processor by 0.6% in >> performance in average. This is mainly because of increased effective >> issue >> width because of increased availability of buffers. Clearly only >> effective >> writeback width should be affected not the effective issue width if >> modeled >> correctly. Long latency instructions like load miss will result in >> decreased issue width until load is completed. Processors with less width >> seems to suffer significantly because of this. >> >> Regarding the issue where multiple accesses to same block causing >> pipeline >> flushes, I have posted this question earlier ( >> http://comments.gmane.org/gmane.comp.emulators.m5.users/16657), >> unfortunately the thread did not proceed further. It has a huge impact on >> performance of upto 40% in PARSEC benchmarks in 8-wide processor, 29% in >> 4-wide processor and 13% in 2-wide processor in average. It would be >> great >> to have the fix for this in gem5 because it is causing unusually high >> flushing activity in pipeline and affects speculation. >> >> Thanks, >> Vamsi Krishna >> >> >> On Mon, May 12, 2014 at 9:39 PM, Steve Reinhardt via gem5-users < >> gem5-users@gem5.org> wrote: >> >>> Paul, >>> >>> Are you talking about the issue where multiple accesses to the same >>> block >>> cause Ruby to tell the core to retry, which in turn causes a pipeline >>> flush? We've seen that too and have a patch that we've been intending >>> to >>> post... this discussion (and the earlier one about store prefetching) >>> have >>> inspired me to try and get that process started again. >>> >>> Thanks for speaking up. I'd much rather have people point out problems, >>> or better yet post patches for them, than stockpile them for a WDDD >>> paper >>> ;-). >>> >>> Steve >>> >>> >>> >>> On Mon, May 12, 2014 at 7:07 PM, Paul V. Gratz via gem5-users < >>> gem5-users@gem5.org> wrote: >>> >>>> Hi All, >>>> Agreed, thanks for confirming we were not missing something. Just some >>>> followup, my student has some data about this he'll post to here >>>> shortly >>>> with the performance impact he sees for this issue, but it is quite >>>> large >>>> for 2-wide OOO. I was thinking it might be something along those >>>> lines >>>> (or something about the bypass network width) but it seems like >>>> grabbing >>>> them at issue time is probably too conservative (as opposed to grabbing >>>> them at completion and stalling the functional unit if you can't get >>>> one). >>>> >>>> I believe Karu Sankaralingham at Wisc also found this and a few other >>>> issues, they have a related paper at WDDD this year. >>>> >>>> We also found a problem where multiple outstanding loads to the same >>>> address causing heavy flushing in O3 w/ ruby that has a similarly large >>>> performance impact, we'll start another thread on that shortly. >>>> Thanks! >>>> Paul >>>> >>>> >>>> >>>> On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users < >>>> gem5-users@gem5.org> wrote: >>>> >>>>> *"Realistically, to me, it seems like those buffers would be >>>>> distributed among the function units anyway, not a global resource, so >>>>> having a global limit doesn't make a lot of sense. Does anyone else >>>>> out >>>>> there agree or disagree?"* >>>>> >>>>> I believe that's more or less correct. With wbWidth probably meant to >>>>> be the # of write ports on the register file and wbDepth being the >>>>> pipe >>>>> stages for a multi-cycle write back. >>>>> >>>>> I don't fully agree that it should be distributed at the function unit >>>>> level, as you could imagine designs with higher issue width and >>>>> functional >>>>> units than the number of register file write ports. Essentially >>>>> allowing >>>>> more instructions to be issued on a given cycle, as long as they did >>>>> not >>>>> all complete on the same cycle. >>>>> >>>>> Going back to Paul's issue (loads holding write back slots on misses). >>>>> The "proper" way to do it would probably be to reserve a slot assuming >>>>> an >>>>> L1 cache hit latency. Give up the slot on a miss. Have an early >>>>> signal >>>>> that a load-miss is coming back from the cache so that you could >>>>> reserve a >>>>> write back slot in parallel with doing all the other necessary work for >>>>> a >>>>> load (CAMing vs the store queue, etc). But this would likely be >>>>> annoying to >>>>> implement. >>>>> >>>>> >>>>> *In general though, yes this seems like something not worth modeling >>>>> in >>>>> gem5 as the potential negative impacts of its current implementation >>>>> outweigh the benefits. And the benefits of fully modeling it are >>>>> likely >>>>> small.* >>>>> >>>>> >>>>> >>>>> On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users < >>>>> gem5-users@gem5.org> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I have no specific knowledge on what are the buffers modeling or what >>>>>> they should be modeling, but I too have encountered this issue some >>>>>> time >>>>>> ago. Setting a high wbDepth is what I do to work around it (actually, >>>>>> 3 is >>>>>> sufficient for me), because performance is indeed suffering quite a >>>>>> lot >>>>>> (and even more for narrow-issue cores if wbWidth == issueWidth, I >>>>>> would >>>>>> expect) in some cases. >>>>>> >>>>>> Le 12/05/2014 19:39, Steve Reinhardt via gem5-users a écrit : >>>>>> >>>>>> Hi Paul, >>>>>> >>>>>> I assume you're talking about the 'wbMax' variable? I don't recall >>>>>> it specifically myself, but after looking at the code a bit, the best >>>>>> I can >>>>>> come up with is that there's assumed to be a finite number of buffers >>>>>> somewhere that hold results from the function units before they write >>>>>> back >>>>>> to the reg file. Realistically, to me, it seems like those buffers >>>>>> would >>>>>> be distributed among the function units anyway, not a global resource, >>>>>> so >>>>>> having a global limit doesn't make a lot of sense. Does anyone else >>>>>> out >>>>>> there agree or disagree? >>>>>> >>>>>> It doesn't seem to relate to any structure that's directly modeled >>>>>> in the code, i.e., I think you could rip the whole thing out >>>>>> (incrWb(), >>>>>> decrWb(), wbOustanding, wbMax) without breaking anything in the >>>>>> model... >>>>>> which would be a good thing if in fact everyone else is either >>>>>> suffering >>>>>> unaware or just working around it by setting a large value for >>>>>> wbDepth. >>>>>> >>>>>> That said, we've done some internal performance correlation work, >>>>>> and I don't recall this being an issue, for whatever that's worth. I >>>>>> know >>>>>> ARM has done some correlation work too; have you run into this? >>>>>> >>>>>> Steve >>>>>> >>>>>> >>>>>> >>>>>> On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users < >>>>>> gem5-users@gem5.org> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> Doing some digging on performance issues in the O3 model we and >>>>>>> others have run into allocation of the writeback buffer having a big >>>>>>> performance impact. Basically, the a writeback buffer is grabbed at >>>>>>> issue >>>>>>> time and held through till completion. With default assumptions >>>>>>> about the >>>>>>> number of available writeback buffers, (x*issue width, where x is 1 >>>>>>> by >>>>>>> default), the buffers often end up bottlenecking the effective issue >>>>>>> width >>>>>>> (particularly in the face of long latency loads grabbing up all the >>>>>>> WB >>>>>>> buffers). What are these structures trying to model? I can see >>>>>>> limiting >>>>>>> the number of instructions allowed to complete and writeback/bypass >>>>>>> in a >>>>>>> cycle but this ends up being much more conservative than that if it >>>>>>> is the >>>>>>> intent. If not why does it do this? We can easily make number of WB >>>>>>> bufs >>>>>>> high but want to understand what is going on here first... >>>>>>> Thanks! >>>>>>> Paul >>>>>>> >>>>>>> -- >>>>>>> ----------------------------------------- >>>>>>> Paul V. Gratz >>>>>>> Assistant Professor >>>>>>> ECE Dept, Texas A&M University >>>>>>> Office: 333M WERC >>>>>>> Phone: 979-488-4551 >>>>>>> http://cesg.tamu.edu/faculty/paul-gratz/ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gem5-users mailing list >>>>>>> gem5-users@gem5.org >>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> gem5-users mailing >>>>>> listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Arthur Perais >>>>>> INRIA Bretagne Atlantique >>>>>> Bâtiment 12E, Bureau E303, Campus de Beaulieu >>>>>> 35042 Rennes, France >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> gem5-users mailing list >>>>>> gem5-users@gem5.org >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gem5-users mailing list >>>>> gem5-users@gem5.org >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>>> >>>> >>>> >>>> >>>> -- >>>> ----------------------------------------- >>>> Paul V. Gratz >>>> Assistant Professor >>>> ECE Dept, Texas A&M University >>>> Office: 333M WERC >>>> Phone: 979-488-4551 >>>> http://cesg.tamu.edu/faculty/paul-gratz/ >>>> >>>> _______________________________________________ >>>> gem5-users mailing list >>>> gem5-users@gem5.org >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>> >>> >>> >>> _______________________________________________ >>> gem5-users mailing list >>> gem5-users@gem5.org >>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>> >> >> >> >> -- >> Regards, >> Vamsi Krishna >> >> _______________________________________________ >> gem5-users mailing list >> gem5-users@gem5.org >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> > -- -- Fernando A. Endo, PhD student and researcher Université de Grenoble, UJF France _______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users