[gem5-users] Re: Low memory bandwidth achieved with STREAM benchmark

王子聪 via gem5-users Sat, 16 Apr 2022 00:14:41 -0700

Hi Majid,

Thanks for your suggestion! I check the default number of MSHRs (in 
configs/common/Caches.py), and found the default #MSHR of L1/L2 are 4 and 20 
respectively.


According to the PACT’18 paper "Cimple: Instruction and Memory Level 
Parallelism: A DSL for Uncovering ILP and MLP”,  it says that "Modern 
processors typically have 6–10 L1 cache MSHRs”, and "Intel’s Haswell 
microarchitecture uses 10 L1 MSHRs (Line Fill Buffers) for handling outstanding 
L1 misses”. So I change to L1 #MSHRs to 16 and L2 #MSHRs to 32 (which I think 
it’s enough to handling outstanding misses), and then change the L1/L2 
prefetcher type to DCPT. Then I got the STREAM output as shown in below:

./build/X86/gem5.opt configs/example/se.py --cpu-type=O3CPU --caches 
--l1d_size=256kB --l1i_size=256kB 
--param="system.cpu[0].dcache.mshrs=16;system.cpu[0].icache.mshrs=16;system.l2.mshrs=32"
 --l2cache --l2_size=8MB --l1i-hwp-type=DCPTPrefetcher 
--l1d-hwp-type=DCPTPrefetcher --l2-hwp-type=DCPTPrefetcher 
--mem-type=DDR3_1600_8x8 -c ../stream/stream
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3479.8     0.004598     0.004598     0.004598
Scale:           3554.0     0.004502     0.004502     0.004502
Add:             4595.0     0.005223     0.005223     0.005223
Triad:           4705.9     0.005100     0.005100     0.005100
-------------------------------------------------------------

The busutil of DRAM also improved:
-------------------------------------------------------------
system.mem_ctrls.dram.bytesRead          239947840  # Total bytes read (Byte)
system.mem_ctrls.dram.bytesWritten       121160640  # Total bytes written (Byte)
system.mem_ctrls.dram.avgRdBW          1611.266685  # Average DRAM read 
bandwidth in MiBytes/s ((Byte/Second))
system.mem_ctrls.dram.avgWrBW           813.602251  # Average DRAM write 
bandwidth in MiBytes/s ((Byte/Second))
system.mem_ctrls.dram.peakBW              12800.00  # Theoretical peak 
bandwidth in MiByte/s ((Byte/Second))
system.mem_ctrls.dram.busUtil                18.94  # Data bus utilization in 
percentage (Ratio)
system.mem_ctrls.dram.busUtilRead            12.59  # Data bus utilization in 
percentage for reads (Ratio)
system.mem_ctrls.dram.busUtilWrite            6.36  # Data bus utilization in 
percentage for writes (Ratio)
system.mem_ctrls.dram.pageHitRate            89.16  # Row buffer hit rate, read 
and write combined (Ratio)
-------------------------------------------------------------

It’s indeed improving the achieved bandwidth, but still a little far away from 
the peak bandwidth of DDR3_1600 (12800 MiB/s). stats.txt is uploaded for 
reference (https://gist.github.com/wzc314/cf29275f853ee0b2fcd865f9b492c355 
<https://gist.github.com/wzc314/cf29275f853ee0b2fcd865f9b492c355>)

Any idea is appreciated!
Thank you in advance!

Bests,
Zicong



> 2022年4月16日 00:08，Majid Jalili <majid...@gmail.com> 写道：
> 
> Hi,
> Make sure your system has enough MSHRs, out of the box, L1, and L2 are set to 
> have a few MSHR entries. 
> Also, stride prefetcher is not the best, you may try something better: DCPT 
> gives me better numbers.
> 
> On Fri, Apr 15, 2022 at 4:57 AM Zicong Wang via gem5-users 
> <gem5-users@gem5.org> wrote:
> Hi Jason,
> 
>   We are testing the memory bandwidth program STREAM 
> (https://www.cs.virginia.edu/stream/), but the results show that the CPU 
> cannot fully utilize the DDR bandwidth, and the achieved bandwidth is quite 
> low and about 1/10 of the peak bandwidth (peakBW in stats.txt). I tested the 
> STREAM binary on my x86 computer and got the near peak bandwidth, so I 
> believe the program is ok.
> 
>   I've seen the maillist dialogue 
> https://www.mail-archive.com/gem5-users@gem5.org/msg12965.html, and I think 
> I've met the similar problem. So I tried the suggestions proposed by 
> Andreas, including enable l1/l2 prefetcher, using ARM detailed 
> CPU. Although these methods can improve the bandwidth, the results show it 
> has limited effect. Besides, I've also tested the STREAM program in FS mode 
> with x86 O3/Minor/TimingSimple CPU, and tested it in SE mode with ruby 
> option, but all the results are similar and there is no essential difference.
> 
>   I guess it is a general problem in simulation with gem5. I'm wondering if 
> the result is expected or is there something wrong with the system model?
> 
>   Two of the experimental results are attached for reference:
> 
> 1. X86 O3CPU, SE-mode, w/o l2 prefetcher:
> 
> ./build/X86/gem5.opt --outdir=m5out-stream configs/example/se.py 
> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache 
> --l2_size=8MB --mem-type=DDR3_1600_8x8 -c ../stream/stream
> 
> STREAM output:
> 
> -------------------------------------------------------------
> Function    Best Rate MB/s     Avg time     Min time     Max time
> Copy:            1099.0     0.014559     0.014559     0.014559
> Scale:           1089.7     0.014683     0.014683     0.014683
> Add:             1213.0     0.019786     0.019786     0.019786
> Triad:           1222.1     0.019639     0.019639     0.019639
> -------------------------------------------------------------
> 
> stats.txt (dram related):
> 
> system.mem_ctrls.dram.bytesRead          238807808   # Total bytes read (Byte)
> system.mem_ctrls.dram.bytesWritten       121179776   # Total bytes written 
> (Byte)
> system.mem_ctrls.dram.avgRdBW           718.689026   # Average DRAM read 
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.avgWrBW           364.688977   # Average DRAM write 
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.peakBW              12800.00   # Theoretical peak 
> bandwidth in MiByte/s ((Byte/Second))
> system.mem_ctrls.dram.busUtil                 8.46   # Data bus utilization 
> in percentage (Ratio)
> system.mem_ctrls.dram.busUtilRead             5.61   # Data bus utilization 
> in percentage for reads (Ratio)
> system.mem_ctrls.dram.busUtilWrite            2.85   # Data bus utilization 
> in percentage for writes (Ratio)
> system.mem_ctrls.dram.pageHitRate            40.57   # Row buffer hit rate, 
> read and write combined (Ratio)
> 
> 
> 
> 2. X86 O3CPU, SE-mode, w/ l2 prefetcher:
> 
> ./build/X86/gem5.opt --outdir=m5out-stream-l2hwp configs/example/se.py 
> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache 
> --l2_size=8MB --l2-hwp-typ=StridePrefetcher --mem-type=DDR3_1600_8x8 -c 
> ../stream/stream 
> 
> STREAM output:
> 
> -------------------------------------------------------------
> Function    Best Rate MB/s     Avg time     Min time     Max time
> Copy:            1703.9     0.009390     0.009390     0.009390
> Scale:           1718.6     0.009310     0.009310     0.009310
> Add:             2087.3     0.011498     0.011498     0.011498
> Triad:           2227.2     0.010776     0.010776     0.010776
> -------------------------------------------------------------
> stats.txt (dram related):
> 
> system.mem_ctrls.dram.bytesRead          238811712   # Total bytes read (Byte)
> system.mem_ctrls.dram.bytesWritten       121179840   # Total bytes written 
> (Byte)
> system.mem_ctrls.dram.avgRdBW          1014.129912   # Average DRAM read 
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.avgWrBW           514.598298   # Average DRAM write 
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.peakBW              12800.00   # Theoretical peak 
> bandwidth in MiByte/s ((Byte/Second))
> system.mem_ctrls.dram.busUtil                11.94   # Data bus utilization 
> in percentage (Ratio)
> system.mem_ctrls.dram.busUtilRead             7.92   # Data bus utilization 
> in percentage for reads (Ratio)
> system.mem_ctrls.dram.busUtilWrite            4.02   # Data bus utilization 
> in percentage for writes (Ratio)
> system.mem_ctrls.dram.pageHitRate            75.37   # Row buffer hit rate, 
> read and write combined (Ratio)
> 
> 
> 
> STREAM compiling options:
> 
> gcc -O2 -static -DSTREAM_ARRAY_SIZE=1000000 -DNTIMES=2 stream.c -o stream
> 
> All the experiments are performed on the latest stable version 
> (141cc37c2d4b93959d4c249b8f7e6a8b2ef75338, v21.2.1).
> 
>   Thank you very much!
> 
> 
> 
> Best Regards,
> 
> Zicong
> 
> 
> 
> _______________________________________________
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

[gem5-users] Re: Low memory bandwidth achieved with STREAM benchmark

Reply via email to