Re: [gem5-users] Writeback buffer kills O3 performance, what is it meant to model?

Fernando Endo via gem5-users Thu, 22 May 2014 03:34:50 -0700

Hello,

To my understanding, wbDepth represents a kind of "average and
effective execution stage depth": wbWidth*wbDepth represents the
maximum allowed in-flight instructions in the EXE, i.e. instructions
that issued but did not writeback yet.


Given that, I agree that such buffers would be better used if they are
distributed among the FUs, because it would represent the FU effective
depth.

I call it "effective depth" because in real hardware the FP functional
unit (FU) may have 4 stages (then its depth is 4), but in gem5 we
would setup a FMAC with a latency greater than 4, say 8 cycles,
because in real hardware there may be more than one pipe inside a FU
(e.g FP ADD, FP MUL, etc). Then, we should set up the FP FU with an
effective EXE depth of 8 to correctly simulate them.

For while, I always setup the wbDepth = max(opLat/issueCycles), which
means that the issueWidth, the issueCycle and the opLat are generally
dictating the maximum number of in-flight instructions in the EXE
(wbDepth should have no major influence).

Thanks for this discussion.

Regards,

2014-05-13 16:32 GMT+02:00, Mitch Hayenga via gem5-users <gem5-users@gem5.org>:
> I actually wrote a patch a while back (apparently Feb 20) that fixed the
> load squash issue.  I kind of abandoned it, but it was able to run a few
> benchmarks (never ran the regression tests on it).  I'll revive that and
> see if it passes the regression tests.
>
> All it did was force the load to be repetitively replayed until it was
> successfully not blocked, rather than squashing the entire pipeline. I
> remember incrWB() and decrWb() was the most annoying part of writing it.
>
> As a side note, I've found generally that increasing tgts_per_mshr to
> something unlikely to get hit largely eliminates the issue (this is why I
> abandoned the patch).  You are still limiting the number of outstanding
> cache lines to a specific number via the number of MSHRs, but don't squash
> just because a bunch of loads all accessed the same line.   This is
> probably a good temporary solution.
>
>
>
> On Tue, May 13, 2014 at 3:09 AM, Vamsi Krishna via gem5-users <
> gem5-users@gem5.org> wrote:
>
>> Hello All,
>>
>> As Paul was mentioning, I tried to come up with small analysis on how the
>> number of writeback buffers affect performance of PARSEC benchmarks when
>> increased by 5x the default size. I found out that 2-wide processor
>> improved by 22% , 4-wide processor by 7% and 8-wide processor by 0.6% in
>> performance in average. This is mainly because of increased effective
>> issue
>> width because of increased availability of buffers. Clearly only
>> effective
>> writeback width should be affected not the effective issue width if
>> modeled
>> correctly. Long latency instructions like load miss will result in
>> decreased issue width until load is completed. Processors with less width
>> seems to suffer significantly because of this.
>>
>> Regarding the issue where multiple accesses to same block causing
>> pipeline
>> flushes, I have posted this question earlier (
>> http://comments.gmane.org/gmane.comp.emulators.m5.users/16657),
>> unfortunately the thread did not proceed further. It has a huge impact on
>> performance of upto 40% in PARSEC benchmarks in 8-wide processor, 29% in
>> 4-wide processor and 13% in 2-wide processor in average. It would be
>> great
>> to have the fix for this in gem5 because it is causing unusually high
>> flushing activity in pipeline and affects speculation.
>>
>> Thanks,
>> Vamsi Krishna
>>
>>
>> On Mon, May 12, 2014 at 9:39 PM, Steve Reinhardt via gem5-users <
>> gem5-users@gem5.org> wrote:
>>
>>> Paul,
>>>
>>> Are you talking about the issue where multiple accesses to the same
>>> block
>>> cause Ruby to tell the core to retry, which in turn causes a pipeline
>>> flush?  We've seen that too and have a patch that we've been intending
>>> to
>>> post... this discussion (and the earlier one about store prefetching)
>>> have
>>> inspired me to try and get that process started again.
>>>
>>> Thanks for speaking up.  I'd much rather have people point out problems,
>>> or better yet post patches for them, than stockpile them for a WDDD
>>> paper
>>> ;-).
>>>
>>> Steve
>>>
>>>
>>>
>>> On Mon, May 12, 2014 at 7:07 PM, Paul V. Gratz via gem5-users <
>>> gem5-users@gem5.org> wrote:
>>>
>>>> Hi All,
>>>> Agreed, thanks for confirming we were not missing something.  Just some
>>>> followup, my student has some data about this he'll post to here
>>>> shortly
>>>> with the performance impact he sees for this issue, but it is quite
>>>> large
>>>> for 2-wide OOO.   I was thinking it might be something along those
>>>> lines
>>>> (or something about the bypass network width) but it seems like
>>>> grabbing
>>>> them at issue time is probably too conservative (as opposed to grabbing
>>>> them at completion and stalling the functional unit if you can't get
>>>> one).
>>>>
>>>> I believe Karu Sankaralingham at Wisc also found this and a few other
>>>> issues, they have a related paper at WDDD this year.
>>>>
>>>> We also found a problem where multiple outstanding loads to the same
>>>> address causing heavy flushing in O3 w/ ruby that has a similarly large
>>>> performance impact, we'll start another thread on that shortly.
>>>> Thanks!
>>>> Paul
>>>>
>>>>
>>>>
>>>> On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users <
>>>> gem5-users@gem5.org> wrote:
>>>>
>>>>> *"Realistically, to me, it seems like those buffers would be
>>>>> distributed among the function units anyway, not a global resource, so
>>>>> having a global limit doesn't make a lot of sense.  Does anyone else
>>>>> out
>>>>> there agree or disagree?"*
>>>>>
>>>>> I believe that's more or less correct.  With wbWidth probably meant to
>>>>> be the # of write ports on the register file and wbDepth being the
>>>>> pipe
>>>>> stages for a multi-cycle write back.
>>>>>
>>>>> I don't fully agree that it should be distributed at the function unit
>>>>> level, as you could imagine designs with higher issue width and
>>>>> functional
>>>>> units than the number of register file write ports.  Essentially
>>>>> allowing
>>>>> more instructions to be issued on a given cycle, as long as they did
>>>>> not
>>>>> all complete on the same cycle.
>>>>>
>>>>> Going back to Paul's issue (loads holding write back slots on misses).
>>>>>  The "proper" way to do it would probably be to reserve a slot assuming
>>>>> an
>>>>> L1 cache hit latency.  Give up the slot on a miss.  Have an early
>>>>> signal
>>>>> that a load-miss is coming back from the cache so that you could
>>>>> reserve a
>>>>> write back slot in parallel with doing all the other necessary work for
>>>>> a
>>>>> load (CAMing vs the store queue, etc). But this would likely be
>>>>> annoying to
>>>>> implement.
>>>>>
>>>>>
>>>>> *In general though, yes this seems like something not worth modeling
>>>>> in
>>>>> gem5 as the potential negative impacts of its current implementation
>>>>> outweigh the benefits.  And the benefits of fully modeling it are
>>>>> likely
>>>>> small.*
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
>>>>> gem5-users@gem5.org> wrote:
>>>>>
>>>>>>  Hi all,
>>>>>>
>>>>>> I have no specific knowledge on what are the buffers modeling or what
>>>>>> they should be modeling, but I too have encountered this issue some
>>>>>> time
>>>>>> ago. Setting a high wbDepth is what I do to work around it (actually,
>>>>>> 3 is
>>>>>> sufficient for me), because performance is indeed suffering quite a
>>>>>> lot
>>>>>> (and even more for narrow-issue cores if wbWidth == issueWidth, I
>>>>>> would
>>>>>> expect) in some cases.
>>>>>>
>>>>>> Le 12/05/2014 19:39, Steve Reinhardt via gem5-users a écrit :
>>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>>  I assume you're talking about the 'wbMax' variable?  I don't recall
>>>>>> it specifically myself, but after looking at the code a bit, the best
>>>>>> I can
>>>>>> come up with is that there's assumed to be a finite number of buffers
>>>>>> somewhere that hold results from the function units before they write
>>>>>> back
>>>>>> to the reg file.  Realistically, to me, it seems like those buffers
>>>>>> would
>>>>>> be distributed among the function units anyway, not a global resource,
>>>>>> so
>>>>>> having a global limit doesn't make a lot of sense.  Does anyone else
>>>>>> out
>>>>>> there agree or disagree?
>>>>>>
>>>>>>  It doesn't seem to relate to any structure that's directly modeled
>>>>>> in the code, i.e., I think you could rip the whole thing out
>>>>>> (incrWb(),
>>>>>> decrWb(), wbOustanding, wbMax) without breaking anything in the
>>>>>> model...
>>>>>> which would be a good thing if in fact everyone else is either
>>>>>> suffering
>>>>>> unaware or just working around it by setting a large value for
>>>>>> wbDepth.
>>>>>>
>>>>>>  That said, we've done some internal performance correlation work,
>>>>>> and I don't recall this being an issue, for whatever that's worth.  I
>>>>>> know
>>>>>> ARM has done some correlation work too; have you run into this?
>>>>>>
>>>>>>  Steve
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
>>>>>> gem5-users@gem5.org> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>> Doing some digging on performance issues in the O3 model we and
>>>>>>> others have run into allocation of the writeback buffer having a big
>>>>>>> performance impact.  Basically, the a writeback buffer is grabbed at
>>>>>>> issue
>>>>>>> time and held through till completion.  With default assumptions
>>>>>>> about the
>>>>>>> number of available writeback buffers, (x*issue width, where x is 1
>>>>>>> by
>>>>>>> default), the buffers often end up bottlenecking the effective issue
>>>>>>> width
>>>>>>> (particularly in the face of long latency loads grabbing up all the
>>>>>>> WB
>>>>>>> buffers).  What are these structures trying to model?  I can see
>>>>>>> limiting
>>>>>>> the number of instructions allowed to complete and writeback/bypass
>>>>>>> in a
>>>>>>> cycle but this ends up being much more conservative than that if it
>>>>>>> is the
>>>>>>> intent.  If not why does it do this?  We can easily make number of WB
>>>>>>> bufs
>>>>>>> high but want to understand what is going on here first...
>>>>>>> Thanks!
>>>>>>>  Paul
>>>>>>>
>>>>>>>  --
>>>>>>> -----------------------------------------
>>>>>>> Paul V. Gratz
>>>>>>> Assistant Professor
>>>>>>> ECE Dept, Texas A&M University
>>>>>>> Office: 333M WERC
>>>>>>> Phone: 979-488-4551
>>>>>>> http://cesg.tamu.edu/faculty/paul-gratz/
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-users mailing list
>>>>>>> gem5-users@gem5.org
>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing
>>>>>> listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Arthur Perais
>>>>>> INRIA Bretagne Atlantique
>>>>>> Bâtiment 12E, Bureau E303, Campus de Beaulieu
>>>>>> 35042 Rennes, France
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-users@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-users@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -----------------------------------------
>>>> Paul V. Gratz
>>>> Assistant Professor
>>>> ECE Dept, Texas A&M University
>>>> Office: 333M WERC
>>>> Phone: 979-488-4551
>>>> http://cesg.tamu.edu/faculty/paul-gratz/
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-users@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-users@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>>
>> --
>> Regards,
>> Vamsi Krishna
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-users@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>


-- 
--
Fernando A. Endo, PhD student and researcher

Université de Grenoble, UJF
France
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Writeback buffer kills O3 performance, what is it meant to model?

Reply via email to