Re: [gem5-users] o3cpu: cache ports

Marcelo Brandalero Thu, 28 Apr 2016 10:33:06 -0700

Hi Arthur,

Thanks for the explanation. I wasn't aware that X86's memory model in gem5
constrained the CPU to 1 store/cycle, and I agree that 3 stores/cycle would
be unrealistic.


I think your patch looks good. It certainly makes it clearer what is going
on and what constrains what. In the case of X86, it also makes sense
considering that (apparently) recent processors have a dedicated cache
write port.

Regards,


On Thu, Apr 28, 2016 at 2:08 PM, Arthur Perais <arthur.per...@inria.fr>
wrote:

> Hello Marcelo,
>
> Thanks for taking the time to dig into this.
>
> The only problem with the current implementation, as I understand it, is
> that there should be a check for loadFUs >= cachePorts. Therefore, since
> loads are executed first, there is no need to test for available cache
> ports. I understand that the arbitration mechanism you mentioned is already
> implemented.
>
>
> In a sense, yes, but arbitration is basically: "do loads, then if ports
> are left at the end of the cycle, do stores". This follows the spirit of
> the SQ that is used to buffer stores until such a time the cache becomes
> ready, but as soon as you set the number of cache ports to less than (loads
> + stores)/cycle, this is going to be detrimental in some cases. Say I have
> two ports, and I can do two loads and one store per cycle. If the program
> sustains two loads per cycle, then the SQ will often become full and stall
> everything until no loads are ready to issue. In that case where, I would
> imagine the processor would attempt to be smarter by prioritizing stores
> once in a while, but as it is modelled, it does not. My patch makes it more
> clear cut, but I agree that we lose the ability to share a cache port
> between loads and stores.
>
> Essentially, solution 1 is add a check loadFUs > cachePorts and trigger an
> abort or a warning if it is true (I don't see the point of having more load
> FUs than ports, unless you crack loads into address and data and you want
> to compute more addresses than you can access the cache with), and pay the
> occasional misbehaving benchmark when  loadFUs + storeWB_per_cycle >
> cachePorts. Note that this will probably hurt x86 more since x86 can only
> have a single store in flight at any given time in gem5's implementation
> (because of x86's memory model).
> Solution 2 is what my patch does (differentiating between load ports and
> store ports).
>
> I prefer the second one (obviously :)) because if I set cachePorts to 3 to
> sustain 2 loads and 1 store per cycle, then during a cycle where I don't do
> any load, I will be able to WB three stores, which is not realistic (note
> that this will not happen in x86, once again).
>
>
> Your patch is correct, but it does change the model a little bit.
> Previously, cache ports could be used by both loads and stores (with
> preference for loads) and now we have dedicated ports for loads (= the
> amount of load FUs) and dedicated ports for stores (= cacheStorePorts). I'm
> not sure whether modern superscalar processors implement dedicated or
> shared cache ports (maybe someone working more closely with caches would
> know that?).
>
>
> I think Intel reports being able to read 2x32B and write 32B per cycle to
> their L1Cache (from http://www.realworldtech.com/haswell-cpu/5/). I don't
> know if this is steady throughput, but if it is, I suppose assuming 2 loads
> and 1 store per cycle is reasonable.
>
>
>
> Regards,
>
>
> On Tue, Apr 26, 2016 at 8:32 AM, Arthur Perais <arthur.per...@inria.fr>
> wrote:
>
>> Alright, I was also waiting for someone else to comment, but I will try
>> to submit a patch this week.
>>
>> Best,
>>
>> Arthur.
>>
>>
>> Le 25/04/2016 17:52, Andreas Hansson a écrit :
>>
>> Hi Arthur,
>>
>> I just wanted to re-iterate that the solution you suggest sounds good.
>> Could you also make sure that the comments (and possibly variable names)
>> are updated to reflect the change?
>>
>> Thanks,
>>
>> Andreas
>>
>> From: gem5-users <gem5-users-boun...@gem5.org> on behalf of Andreas
>> Hansson <andreas.hans...@arm.com>
>> Reply-To: gem5 users mailing list <gem5-users@gem5.org>
>> Date: Saturday, 23 April 2016 at 12:18
>> To: gem5 users mailing list <gem5-users@gem5.org>
>> Subject: Re: [gem5-users] o3cpu: cache ports
>>
>> Hi Arthur,
>>
>> I agree with your observations, but it would be good if someone more
>> familiar with the o3 model could chime in.
>>
>> Andreas
>>
>> From: gem5-users <gem5-users-boun...@gem5.org> on behalf of Arthur
>> Perais <arthur.per...@inria.fr>
>> Reply-To: gem5 users mailing list <gem5-users@gem5.org>
>> Date: Tuesday, 19 April 2016 at 10:41
>> To: "gem5-users@gem5.org" <gem5-users@gem5.org>
>> Subject: [gem5-users] o3cpu: cache ports
>>
>> Hi all,
>>
>> In the O3 LSQ there is a variable called "cachePorts" which controls the
>> number of stores that can be made each cycle (see lines 790-795 in
>> lsq_unit_impl.hh).
>> cachePorts defaults to 200 (see O3CPU.py), so in practice, there is no
>> limit on the number of stores that are written back to the D-Cache, and
>> everything works out fine.
>>
>> Now, silly me wanted to be a bit more realistic and set cachePorts to
>> one, so that I could issue one store per cycle to the D-Cache only.
>> In a few SPEC programs, this caused the SQFullEvent to get very high,
>> which I assumed was reasonable because well, less stores per cycle.
>> However, after looking into it, I found that the variable "usedPorts"
>> (which allows stores to WB only if it is < to "cachePorts") is increased by
>> stores when they WB (which is fine), but also by *load*s when they
>> access the D-Cache (see lines 768 and 814 in lsq_unit.hh). However, the
>> number of loads that can access the D-Cache each cycle is controlled by the
>> number of load functional units, and not at all by "cachePorts".
>>
>> This means that if I set cachePorts to 1, and I have 2 load FUs, I can do
>> 2 loads per cycle, but as soon as I do one load, then I cannot writeback
>> any store this cycle (because "usePorts" will already be 1 or 2 when gem5
>> enters writebackStores() in lsq_unit_impl.hh). On the other hand, if I set
>> cachePorts to 3 I can do 2 loads and one store per cycle, but I can also WB
>> three stores in a single cycle, which is not what I wanted to be able to do.
>>
>> This should be addressed by not increasing "usedPorts" when loads access
>> the D-Cache and being explicit about what variable constrains what (i.e.,
>> loads are constrained by load FUs and stores by "cachePorts"), or by also
>> contraining loads on "cachePorts" (which will be hard since arbitration
>> would potentially be needed between loads and stores, and since store WBs
>> happen after load accesses in gem5, this can get messy). As of now, this is
>> a bit of both, and performance looks fine at first, but it's really not.
>>
>> I can write a small patch for the first solution (don't increase
>> "usedPorts" on load accesses), but I am not sure this corresponds to the
>> philosophy of the code. What do you think would be the best course of
>> action?
>>
>> Best,
>>
>> Arthur.
>>
>> --
>> Arthur Perais
>> INRIA Bretagne Atlantique
>> Bâtiment 12E, Bureau E303, Campus de Beaulieu
>> 35042 Rennes, France
>>
>> IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy the
>> information in any medium. Thank you.
>> IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy the
>> information in any medium. Thank you.
>>
>> _______________________________________________
>> gem5-users mailing 
>> listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>>
>>
>> --
>> Arthur Perais
>> INRIA Bretagne Atlantique
>> Bâtiment 12E, Bureau E303, Campus de Beaulieu
>> 35042 Rennes, France
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-users@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
>
> --
> Marcelo Brandalero
> PhD student
> Programa de Pós Graduação em Computação
> Universidade Federal do Rio Grande do Sul
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>



-- 
Marcelo Brandalero
PhD student
Programa de Pós Graduação em Computação
Universidade Federal do Rio Grande do Sul

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] o3cpu: cache ports

Reply via email to