I think it is hard to get to a real machine level in terms of BW. But By looking at your stats, I found the lsqFullEvents is high. You can go after the CPU to make it more aggressive, increasing Load/Store queue size, and ROB depth are the minimal changes you can make. I usually do at least ROB sizes of 256 or 320. With that, you may set the LSQ size to at least 1/4 of ROB size. For MSHRs, your numbers are good now, 10 is too little even in intel machines, I found recently they increased that to 16-20. The other thing you can try to st is the cache latencies, make sure that they are reasonable. For prefetcher, you can use IMPPrefetcher in addition to DCPT, it has a pretty aggressive stream prefetcher inside. Also, DRAM memory mapping is important, I do not remember what is the default for the the mem type you are using
Majid On Sat, Apr 16, 2022 at 2:12 AM 王子聪 <wangzic...@nudt.edu.cn> wrote: > Hi Majid, > > Thanks for your suggestion! I check the default number of MSHRs (in > configs/common/Caches.py), and found the default #MSHR of L1/L2 are 4 and > 20 respectively. > > According to the PACT’18 paper "Cimple: Instruction and Memory Level > Parallelism: A DSL for Uncovering ILP and MLP”, it says that "Modern > processors typically have 6–10 L1 cache MSHRs”, and "Intel’s Haswell > microarchitecture uses 10 L1 MSHRs (Line Fill Buffers) for > handling outstanding L1 misses”. So I change to L1 #MSHRs to 16 and L2 > #MSHRs to 32 (which I think it’s enough to handling outstanding misses), > and then change the L1/L2 prefetcher type to DCPT. Then I got the STREAM > output as shown in below: > > ./build/X86/gem5.opt configs/example/se.py --cpu-type=O3CPU --caches > --l1d_size=256kB --l1i_size=256kB > --param="system.cpu[0].dcache.mshrs=16;system.cpu[0].icache.mshrs=16;system.l2.mshrs=32" > --l2cache --l2_size=8MB --l1i-hwp-type=DCPTPrefetcher > --l1d-hwp-type=DCPTPrefetcher --l2-hwp-type=DCPTPrefetcher > --mem-type=DDR3_1600_8x8 -c ../stream/stream > ------------------------------------------------------------- > Function Best Rate MB/s Avg time Min time Max time > Copy: 3479.8 0.004598 0.004598 0.004598 > Scale: 3554.0 0.004502 0.004502 0.004502 > Add: 4595.0 0.005223 0.005223 0.005223 > Triad: 4705.9 0.005100 0.005100 0.005100 > ------------------------------------------------------------- > > The busutil of DRAM also improved: > ------------------------------------------------------------- > system.mem_ctrls.dram.bytesRead 239947840 # Total bytes read > (Byte) > system.mem_ctrls.dram.bytesWritten 121160640 # Total bytes written > (Byte) > system.mem_ctrls.dram.avgRdBW 1611.266685 # Average DRAM read > bandwidth in MiBytes/s ((Byte/Second)) > system.mem_ctrls.dram.avgWrBW 813.602251 # Average DRAM write > bandwidth in MiBytes/s ((Byte/Second)) > system.mem_ctrls.dram.peakBW 12800.00 # Theoretical peak > bandwidth in MiByte/s ((Byte/Second)) > system.mem_ctrls.dram.busUtil 18.94 # Data bus utilization > in percentage (Ratio) > system.mem_ctrls.dram.busUtilRead 12.59 # Data bus utilization > in percentage for reads (Ratio) > system.mem_ctrls.dram.busUtilWrite 6.36 # Data bus utilization > in percentage for writes (Ratio) > system.mem_ctrls.dram.pageHitRate 89.16 # Row buffer hit rate, > read and write combined (Ratio) > ------------------------------------------------------------- > > It’s indeed improving the achieved bandwidth, but still a little far away > from the peak bandwidth of DDR3_1600 (12800 MiB/s). stats.txt is uploaded > for reference ( > https://gist.github.com/wzc314/cf29275f853ee0b2fcd865f9b492c355) > > Any idea is appreciated! > Thank you in advance! > > Bests, > Zicong > > > > 2022年4月16日 00:08,Majid Jalili <majid...@gmail.com> 写道: > > Hi, > Make sure your system has enough MSHRs, out of the box, L1, and L2 are set > to have a few MSHR entries. > Also, stride prefetcher is not the best, you may try something better: > DCPT gives me better numbers. > > On Fri, Apr 15, 2022 at 4:57 AM Zicong Wang via gem5-users < > gem5-users@gem5.org> wrote: > Hi Jason, > > We are testing the memory bandwidth program STREAM ( > https://www.cs.virginia.edu/stream/), but the results show that the CPU > cannot fully utilize the DDR bandwidth, and the achieved bandwidth is quite > low and about 1/10 of the peak bandwidth (peakBW in stats.txt). I tested > the STREAM binary on my x86 computer and got the near peak bandwidth, so I > believe the program is ok. > > I've seen the maillist dialogue > https://www.mail-archive.com/gem5-users@gem5.org/msg12965.html, and > I think I've met the similar problem. So I tried the suggestions proposed > by Andreas, including enable l1/l2 prefetcher, using ARM > detailed CPU. Although these methods can improve the bandwidth, the results > show it has limited effect. Besides, I've also tested the STREAM program in > FS mode with x86 O3/Minor/TimingSimple CPU, and tested it in SE mode with > ruby option, but all the results are similar and there is no essential > difference. > > I guess it is a general problem in simulation with gem5. I'm wondering > if the result is expected or is there something wrong with the system model? > > Two of the experimental results are attached for reference: > > 1. X86 O3CPU, SE-mode, w/o l2 prefetcher: > > ./build/X86/gem5.opt --outdir=m5out-stream configs/example/se.py > --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache > --l2_size=8MB --mem-type=DDR3_1600_8x8 -c ../stream/stream > > STREAM output: > > ------------------------------------------------------------- > Function Best Rate MB/s Avg time Min time Max time > Copy: 1099.0 0.014559 0.014559 0.014559 > Scale: 1089.7 0.014683 0.014683 0.014683 > Add: 1213.0 0.019786 0.019786 0.019786 > Triad: 1222.1 0.019639 0.019639 0.019639 > ------------------------------------------------------------- > > stats.txt (dram related): > > system.mem_ctrls.dram.bytesRead 238807808 # Total bytes read > (Byte) > system.mem_ctrls.dram.bytesWritten 121179776 # Total bytes written > (Byte) > system.mem_ctrls.dram.avgRdBW 718.689026 # Average DRAM read > bandwidth in MiBytes/s ((Byte/Second)) > system.mem_ctrls.dram.avgWrBW 364.688977 # Average DRAM write > bandwidth in MiBytes/s ((Byte/Second)) > system.mem_ctrls.dram.peakBW 12800.00 # Theoretical peak > bandwidth in MiByte/s ((Byte/Second)) > system.mem_ctrls.dram.busUtil 8.46 # Data bus > utilization in percentage (Ratio) > system.mem_ctrls.dram.busUtilRead 5.61 # Data bus > utilization in percentage for reads (Ratio) > system.mem_ctrls.dram.busUtilWrite 2.85 # Data bus > utilization in percentage for writes (Ratio) > system.mem_ctrls.dram.pageHitRate 40.57 # Row buffer hit > rate, read and write combined (Ratio) > > > > 2. X86 O3CPU, SE-mode, w/ l2 prefetcher: > > ./build/X86/gem5.opt --outdir=m5out-stream-l2hwp configs/example/se.py > --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache > --l2_size=8MB --l2-hwp-typ=StridePrefetcher --mem-type=DDR3_1600_8x8 -c > ../stream/stream > > STREAM output: > > ------------------------------------------------------------- > Function Best Rate MB/s Avg time Min time Max time > Copy: 1703.9 0.009390 0.009390 0.009390 > Scale: 1718.6 0.009310 0.009310 0.009310 > Add: 2087.3 0.011498 0.011498 0.011498 > Triad: 2227.2 0.010776 0.010776 0.010776 > ------------------------------------------------------------- > stats.txt (dram related): > > system.mem_ctrls.dram.bytesRead 238811712 # Total bytes read > (Byte) > system.mem_ctrls.dram.bytesWritten 121179840 # Total bytes written > (Byte) > system.mem_ctrls.dram.avgRdBW 1014.129912 # Average DRAM read > bandwidth in MiBytes/s ((Byte/Second)) > system.mem_ctrls.dram.avgWrBW 514.598298 # Average DRAM write > bandwidth in MiBytes/s ((Byte/Second)) > system.mem_ctrls.dram.peakBW 12800.00 # Theoretical peak > bandwidth in MiByte/s ((Byte/Second)) > system.mem_ctrls.dram.busUtil 11.94 # Data bus > utilization in percentage (Ratio) > system.mem_ctrls.dram.busUtilRead 7.92 # Data bus > utilization in percentage for reads (Ratio) > system.mem_ctrls.dram.busUtilWrite 4.02 # Data bus > utilization in percentage for writes (Ratio) > system.mem_ctrls.dram.pageHitRate 75.37 # Row buffer hit > rate, read and write combined (Ratio) > > > > STREAM compiling options: > > gcc -O2 -static -DSTREAM_ARRAY_SIZE=1000000 -DNTIMES=2 stream.c -o stream > > All the experiments are performed on the latest stable > version (141cc37c2d4b93959d4c249b8f7e6a8b2ef75338, v21.2.1). > > Thank you very much! > > > > Best Regards, > > Zicong > > > > _______________________________________________ > gem5-users mailing list -- gem5-users@gem5.org > To unsubscribe send an email to gem5-users-le...@gem5.org > %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s > > >
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s