Hello Marcelo,
Thanks for taking the time to dig into this.
> The only problem with the current implementation, as I understand it, is that
> there should be a check for loadFUs >= cachePorts. Therefore, since loads
> are executed first, there is no need to test for available cache ports. I
> understand that the arbitration mechanism you mentioned is already
> implemented.
In a sense, yes, but arbitration is basically: "do loads, then if ports are
left at the end of the cycle, do stores". This follows the spirit of the SQ
that is used to buffer stores until such a time the cache becomes ready, but as
soon as you set the number of cache ports to less than (loads + stores)/cycle,
this is going to be detrimental in some cases. Say I have two ports, and I can
do two loads and one store per cycle. If the program sustains two loads per
cycle, then the SQ will often become full and stall everything until no loads
are ready to issue. In that case where, I would imagine the processor would
attempt to be smarter by prioritizing stores once in a while, but as it is
modelled, it does not. My patch makes it more clear cut, but I agree that we
lose the ability to share a cache port between loads and stores.
Essentially, solution 1 is add a check loadFUs > cachePorts and trigger an
abort or a warning if it is true (I don't see the point of having more load FUs
than ports, unless you crack loads into address and data and you want to
compute more addresses than you can access the cache with), and pay the
occasional misbehaving benchmark when loadFUs + storeWB_per_cycle > cachePorts.
Note that this will probably hurt x86 more since x86 can only have a single
store in flight at any given time in gem5's implementation (because of x86's
memory model).
Solution 2 is what my patch does (differentiating between load ports and store
ports).
I prefer the second one (obviously :)) because if I set cachePorts to 3 to
sustain 2 loads and 1 store per cycle, then during a cycle where I don't do any
load, I will be able to WB three stores, which is not realistic (note that this
will not happen in x86, once again).
> Your patch is correct, but it does change the model a little bit. Previously,
> cache ports could be used by both loads and stores (with preference for
> loads) and now we have dedicated ports for loads (= the amount of load FUs)
> and dedicated ports for stores (= cacheStorePorts). I'm not sure whether
> modern superscalar processors implement dedicated or shared cache ports
> (maybe someone working more closely with caches would know that?).
I think Intel reports being able to read 2x32B and write 32B per cycle to their
L1Cache (from http://www.realworldtech.com/haswell-cpu/5/ ). I don't know if
this is steady throughput, but if it is, I suppose assuming 2 loads and 1 store
per cycle is reasonable.
> Regards,
> On Tue, Apr 26, 2016 at 8:32 AM, Arthur Perais < [email protected] >
> wrote:
> > Alright, I was also waiting for someone else to comment, but I will try to
> > submit a patch this week.
>
> > Best,
>
> > Arthur.
>
> > Le 25/04/2016 17:52, Andreas Hansson a écrit :
>
> > > Hi Arthur,
> >
>
> > > I just wanted to re-iterate that the solution you suggest sounds good.
> > > Could
> > > you also make sure that the comments (and possibly variable names) are
> > > updated to reflect the change?
> >
>
> > > Thanks,
> >
>
> > > Andreas
> >
>
> > > From: gem5-users < [email protected] > on behalf of Andreas
> > > Hansson
> > > < [email protected] >
> >
>
> > > Reply-To: gem5 users mailing list < [email protected] >
> >
>
> > > Date: Saturday, 23 April 2016 at 12:18
> >
>
> > > To: gem5 users mailing list < [email protected] >
> >
>
> > > Subject: Re: [gem5-users] o3cpu: cache ports
> >
>
> > > Hi Arthur,
> >
>
> > > I agree with your observations, but it would be good if someone more
> > > familiar
> > > with the o3 model could chime in.
> >
>
> > > Andreas
> >
>
> > > From: gem5-users < [email protected] > on behalf of Arthur
> > > Perais
> > > <
> > > [email protected] >
> >
>
> > > Reply-To: gem5 users mailing list < [email protected] >
> >
>
> > > Date: Tuesday, 19 April 2016 at 10:41
> >
>
> > > To: " [email protected] " < [email protected] >
> >
>
> > > Subject: [gem5-users] o3cpu: cache ports
> >
>
> > > Hi all,
> >
>
> > > In the O3 LSQ there is a variable called "cachePorts" which controls the
> > > number of stores that can be made each cycle (see lines 790-795 in
> > > lsq_unit_impl.hh).
> >
>
> > > cachePorts defaults to 200 (see O3CPU.py), so in practice, there is no
> > > limit
> > > on the number of stores that are written back to the D-Cache, and
> > > everything
> > > works out fine.
> >
>
> > > Now, silly me wanted to be a bit more realistic and set cachePorts to
> > > one,
> > > so
> > > that I could issue one store per cycle to the D-Cache only.
> >
>
> > > In a few SPEC programs, this caused the SQFullEvent to get very high,
> > > which
> > > I
> > > assumed was reasonable because well, less stores per cycle. However,
> > > after
> > > looking into it, I found that the variable "usedPorts" (which allows
> > > stores
> > > to WB only if it is < to "cachePorts") is increased by stores when they
> > > WB
> > > (which is fine), but also by load s when they access the D-Cache (see
> > > lines
> > > 768 and 814 in lsq_unit.hh). However, the number of loads that can access
> > > the D-Cache each cycle is controlled by the number of load functional
> > > units,
> > > and not at all by "cachePorts".
> >
>
> > > This means that if I set cachePorts to 1, and I have 2 load FUs, I can do
> > > 2
> > > loads per cycle, but as soon as I do one load, then I cannot writeback
> > > any
> > > store this cycle (because "usePorts" will already be 1 or 2 when gem5
> > > enters
> > > writebackStores() in lsq_unit_impl.hh). On the other hand, if I set
> > > cachePorts to 3 I can do 2 loads and one store per cycle, but I can also
> > > WB
> > > three stores in a single cycle, which is not what I wanted to be able to
> > > do.
> >
>
> > > This should be addressed by not increasing "usedPorts" when loads access
> > > the
> > > D-Cache and being explicit about what variable constrains what (i.e.,
> > > loads
> > > are constrained by load FUs and stores by "cachePorts"), or by also
> > > contraining loads on "cachePorts" (which will be hard since arbitration
> > > would potentially be needed between loads and stores, and since store WBs
> > > happen after load accesses in gem5, this can get messy). As of now, this
> > > is
> > > a bit of both, and performance looks fine at first, but it's really not.
> >
>
> > > I can write a small patch for the first solution (don't increase
> > > "usedPorts"
> > > on load accesses), but I am not sure this corresponds to the philosophy
> > > of
> > > the code. What do you think would be the best course of action?
> >
>
> > > Best,
> >
>
> > > Arthur.
> >
>
> > > --
> >
>
> > > Arthur Perais
> >
>
> > > INRIA Bretagne Atlantique
> >
>
> > > Bâtiment 12E, Bureau E303, Campus de Beaulieu
> >
>
> > > 35042 Rennes, France
> >
>
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose the
> > > contents to any other person, use it for any purpose, or store or copy
> > > the
> > > information in any medium. Thank you.
> >
>
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose the
> > > contents to any other person, use it for any purpose, or store or copy
> > > the
> > > information in any medium. Thank you.
> >
>
> > > _______________________________________________
> >
>
> > > gem5-users mailing list [email protected]
> > > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> >
>
> > --
>
> > Arthur Perais
>
> > INRIA Bretagne Atlantique
>
> > Bâtiment 12E, Bureau E303, Campus de Beaulieu
>
> > 35042 Rennes, France
>
> > _______________________________________________
>
> > gem5-users mailing list
>
> > [email protected]
>
> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> --
> Marcelo Brandalero
> PhD student
> Programa de Pós Graduação em Computação
> Universidade Federal do Rio Grande do Sul
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users