[gem5-users] Re: Low memory bandwidth achieved with STREAM benchmark

Jason Lowe-Power via gem5-users Sat, 23 Apr 2022 10:23:41 -0700

Majid,

These are all great suggestions! Do you have a configuration file that you
would be willing to share? It would be a huge benefit to the community if
we had some better default configurations in the "examples" for gem5
configuration files.


We're also trying to use the new standard library for these kinds of "good"
configurations. We can work with you to create a "prebuilt board" with all
of these parameters and even run nightly/weekly tests to make sure there
are no performance regressions.

Thanks!
Jason

On Fri, Apr 22, 2022 at 7:52 PM Majid Jalili <majid...@gmail.com> wrote:

> I think it is hard to get to a real machine level in terms of BW. But By
> looking at your stats, I found the lsqFullEvents is high.
> You can go after the CPU to make it more aggressive, increasing Load/Store
> queue size, and ROB depth are the minimal changes you can make. I
> usually do at least ROB sizes of 256 or 320. With that, you may set the LSQ
> size to at least 1/4  of ROB size.
> For MSHRs, your numbers are good now, 10 is too little even in intel
> machines, I found recently they increased that to 16-20.
> The other thing you can try to st is the cache latencies, make sure that
> they are reasonable.
> For prefetcher, you can use IMPPrefetcher in addition to DCPT, it has a
> pretty aggressive stream prefetcher inside.
> Also, DRAM memory mapping is important, I do not remember what is the
> default for the the mem type you are using
>
> Majid
>
>
>
> On Sat, Apr 16, 2022 at 2:12 AM 王子聪 <wangzic...@nudt.edu.cn> wrote:
>
>> Hi Majid,
>>
>> Thanks for your suggestion! I check the default number of MSHRs (in
>> configs/common/Caches.py), and found the default #MSHR of L1/L2 are 4 and
>> 20 respectively.
>>
>> According to the PACT’18 paper "Cimple: Instruction and Memory Level
>> Parallelism: A DSL for Uncovering ILP and MLP”,  it says that "Modern
>> processors typically have 6–10 L1 cache MSHRs”, and "Intel’s Haswell
>> microarchitecture uses 10 L1 MSHRs (Line Fill Buffers) for
>> handling outstanding L1 misses”. So I change to L1 #MSHRs to 16 and L2
>> #MSHRs to 32 (which I think it’s enough to handling outstanding misses),
>> and then change the L1/L2 prefetcher type to DCPT. Then I got the STREAM
>> output as shown in below:
>>
>> ./build/X86/gem5.opt configs/example/se.py --cpu-type=O3CPU --caches
>> --l1d_size=256kB --l1i_size=256kB
>> --param="system.cpu[0].dcache.mshrs=16;system.cpu[0].icache.mshrs=16;system.l2.mshrs=32"
>> --l2cache --l2_size=8MB --l1i-hwp-type=DCPTPrefetcher
>> --l1d-hwp-type=DCPTPrefetcher --l2-hwp-type=DCPTPrefetcher
>> --mem-type=DDR3_1600_8x8 -c ../stream/stream
>> -------------------------------------------------------------
>> Function    Best Rate MB/s  Avg time     Min time     Max time
>> Copy:            3479.8     0.004598     0.004598     0.004598
>> Scale:           3554.0     0.004502     0.004502     0.004502
>> Add:             4595.0     0.005223     0.005223     0.005223
>> Triad:           4705.9     0.005100     0.005100     0.005100
>> -------------------------------------------------------------
>>
>> The busutil of DRAM also improved:
>> -------------------------------------------------------------
>> system.mem_ctrls.dram.bytesRead          239947840  # Total bytes read
>> (Byte)
>> system.mem_ctrls.dram.bytesWritten       121160640  # Total bytes written
>> (Byte)
>> system.mem_ctrls.dram.avgRdBW          1611.266685  # Average DRAM read
>> bandwidth in MiBytes/s ((Byte/Second))
>> system.mem_ctrls.dram.avgWrBW           813.602251  # Average DRAM write
>> bandwidth in MiBytes/s ((Byte/Second))
>> system.mem_ctrls.dram.peakBW              12800.00  # Theoretical peak
>> bandwidth in MiByte/s ((Byte/Second))
>> system.mem_ctrls.dram.busUtil                18.94  # Data bus
>> utilization in percentage (Ratio)
>> system.mem_ctrls.dram.busUtilRead            12.59  # Data bus
>> utilization in percentage for reads (Ratio)
>> system.mem_ctrls.dram.busUtilWrite            6.36  # Data bus
>> utilization in percentage for writes (Ratio)
>> system.mem_ctrls.dram.pageHitRate            89.16  # Row buffer hit
>> rate, read and write combined (Ratio)
>> -------------------------------------------------------------
>>
>> It’s indeed improving the achieved bandwidth, but still a little far away
>> from the peak bandwidth of DDR3_1600 (12800 MiB/s). stats.txt is uploaded
>> for reference (
>> https://gist.github.com/wzc314/cf29275f853ee0b2fcd865f9b492c355)
>>
>> Any idea is appreciated!
>> Thank you in advance!
>>
>> Bests,
>> Zicong
>>
>>
>>
>> 2022年4月16日 00:08，Majid Jalili <majid...@gmail.com> 写道：
>>
>> Hi,
>> Make sure your system has enough MSHRs, out of the box, L1, and L2 are
>> set to have a few MSHR entries.
>> Also, stride prefetcher is not the best, you may try something better:
>> DCPT gives me better numbers.
>>
>> On Fri, Apr 15, 2022 at 4:57 AM Zicong Wang via gem5-users <
>> gem5-users@gem5.org> wrote:
>> Hi Jason,
>>
>>   We are testing the memory bandwidth program STREAM (
>> https://www.cs.virginia.edu/stream/), but the results show that the CPU
>> cannot fully utilize the DDR bandwidth, and the achieved bandwidth is quite
>> low and about 1/10 of the peak bandwidth (peakBW in stats.txt). I tested
>> the STREAM binary on my x86 computer and got the near peak bandwidth, so I
>> believe the program is ok.
>>
>>   I've seen the maillist dialogue
>> https://www.mail-archive.com/gem5-users@gem5.org/msg12965.html, and
>> I think I've met the similar problem. So I tried the suggestions proposed
>> by Andreas, including enable l1/l2 prefetcher, using ARM
>> detailed CPU. Although these methods can improve the bandwidth, the results
>> show it has limited effect. Besides, I've also tested the STREAM program in
>> FS mode with x86 O3/Minor/TimingSimple CPU, and tested it in SE mode with
>> ruby option, but all the results are similar and there is no essential
>> difference.
>>
>>   I guess it is a general problem in simulation with gem5. I'm wondering
>> if the result is expected or is there something wrong with the system model?
>>
>>   Two of the experimental results are attached for reference:
>>
>> 1. X86 O3CPU, SE-mode, w/o l2 prefetcher:
>>
>> ./build/X86/gem5.opt --outdir=m5out-stream configs/example/se.py
>> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache
>> --l2_size=8MB --mem-type=DDR3_1600_8x8 -c ../stream/stream
>>
>> STREAM output:
>>
>> -------------------------------------------------------------
>> Function    Best Rate MB/s     Avg time     Min time     Max time
>> Copy:            1099.0     0.014559     0.014559     0.014559
>> Scale:           1089.7     0.014683     0.014683     0.014683
>> Add:             1213.0     0.019786     0.019786     0.019786
>> Triad:           1222.1     0.019639     0.019639     0.019639
>> -------------------------------------------------------------
>>
>> stats.txt (dram related):
>>
>> system.mem_ctrls.dram.bytesRead          238807808   # Total bytes read
>> (Byte)
>> system.mem_ctrls.dram.bytesWritten       121179776   # Total bytes
>> written (Byte)
>> system.mem_ctrls.dram.avgRdBW           718.689026   # Average DRAM read
>> bandwidth in MiBytes/s ((Byte/Second))
>> system.mem_ctrls.dram.avgWrBW           364.688977   # Average DRAM write
>> bandwidth in MiBytes/s ((Byte/Second))
>> system.mem_ctrls.dram.peakBW              12800.00   # Theoretical peak
>> bandwidth in MiByte/s ((Byte/Second))
>> system.mem_ctrls.dram.busUtil                 8.46   # Data bus
>> utilization in percentage (Ratio)
>> system.mem_ctrls.dram.busUtilRead             5.61   # Data bus
>> utilization in percentage for reads (Ratio)
>> system.mem_ctrls.dram.busUtilWrite            2.85   # Data bus
>> utilization in percentage for writes (Ratio)
>> system.mem_ctrls.dram.pageHitRate            40.57   # Row buffer hit
>> rate, read and write combined (Ratio)
>>
>>
>>
>> 2. X86 O3CPU, SE-mode, w/ l2 prefetcher:
>>
>> ./build/X86/gem5.opt --outdir=m5out-stream-l2hwp configs/example/se.py
>> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache
>> --l2_size=8MB --l2-hwp-typ=StridePrefetcher --mem-type=DDR3_1600_8x8 -c
>> ../stream/stream
>>
>> STREAM output:
>>
>> -------------------------------------------------------------
>> Function    Best Rate MB/s     Avg time     Min time     Max time
>> Copy:            1703.9     0.009390     0.009390     0.009390
>> Scale:           1718.6     0.009310     0.009310     0.009310
>> Add:             2087.3     0.011498     0.011498     0.011498
>> Triad:           2227.2     0.010776     0.010776     0.010776
>> -------------------------------------------------------------
>> stats.txt (dram related):
>>
>> system.mem_ctrls.dram.bytesRead          238811712   # Total bytes read
>> (Byte)
>> system.mem_ctrls.dram.bytesWritten       121179840   # Total bytes
>> written (Byte)
>> system.mem_ctrls.dram.avgRdBW          1014.129912   # Average DRAM read
>> bandwidth in MiBytes/s ((Byte/Second))
>> system.mem_ctrls.dram.avgWrBW           514.598298   # Average DRAM write
>> bandwidth in MiBytes/s ((Byte/Second))
>> system.mem_ctrls.dram.peakBW              12800.00   # Theoretical peak
>> bandwidth in MiByte/s ((Byte/Second))
>> system.mem_ctrls.dram.busUtil                11.94   # Data bus
>> utilization in percentage (Ratio)
>> system.mem_ctrls.dram.busUtilRead             7.92   # Data bus
>> utilization in percentage for reads (Ratio)
>> system.mem_ctrls.dram.busUtilWrite            4.02   # Data bus
>> utilization in percentage for writes (Ratio)
>> system.mem_ctrls.dram.pageHitRate            75.37   # Row buffer hit
>> rate, read and write combined (Ratio)
>>
>>
>>
>> STREAM compiling options:
>>
>> gcc -O2 -static -DSTREAM_ARRAY_SIZE=1000000 -DNTIMES=2 stream.c -o stream
>>
>> All the experiments are performed on the latest stable
>> version (141cc37c2d4b93959d4c249b8f7e6a8b2ef75338, v21.2.1).
>>
>>   Thank you very much!
>>
>>
>>
>> Best Regards,
>>
>> Zicong
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list -- gem5-users@gem5.org
>> To unsubscribe send an email to gem5-users-le...@gem5.org
>> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>>
>>
>>

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

[gem5-users] Re: Low memory bandwidth achieved with STREAM benchmark

Reply via email to