I think it is hard to get to a real machine level in terms of BW. But By
looking at your stats, I found the lsqFullEvents is high.
You can go after the CPU to make it more aggressive, increasing Load/Store
queue size, and ROB depth are the minimal changes you can make. I
usually do at least ROB sizes of 256 or 320. With that, you may set the LSQ
size to at least 1/4  of ROB size.
For MSHRs, your numbers are good now, 10 is too little even in intel
machines, I found recently they increased that to 16-20.
The other thing you can try to st is the cache latencies, make sure that
they are reasonable.
For prefetcher, you can use IMPPrefetcher in addition to DCPT, it has a
pretty aggressive stream prefetcher inside.
Also, DRAM memory mapping is important, I do not remember what is the
default for the the mem type you are using

Majid



On Sat, Apr 16, 2022 at 2:12 AM 王子聪 <wangzic...@nudt.edu.cn> wrote:

> Hi Majid,
>
> Thanks for your suggestion! I check the default number of MSHRs (in
> configs/common/Caches.py), and found the default #MSHR of L1/L2 are 4 and
> 20 respectively.
>
> According to the PACT’18 paper "Cimple: Instruction and Memory Level
> Parallelism: A DSL for Uncovering ILP and MLP”,  it says that "Modern
> processors typically have 6–10 L1 cache MSHRs”, and "Intel’s Haswell
> microarchitecture uses 10 L1 MSHRs (Line Fill Buffers) for
> handling outstanding L1 misses”. So I change to L1 #MSHRs to 16 and L2
> #MSHRs to 32 (which I think it’s enough to handling outstanding misses),
> and then change the L1/L2 prefetcher type to DCPT. Then I got the STREAM
> output as shown in below:
>
> ./build/X86/gem5.opt configs/example/se.py --cpu-type=O3CPU --caches
> --l1d_size=256kB --l1i_size=256kB
> --param="system.cpu[0].dcache.mshrs=16;system.cpu[0].icache.mshrs=16;system.l2.mshrs=32"
> --l2cache --l2_size=8MB --l1i-hwp-type=DCPTPrefetcher
> --l1d-hwp-type=DCPTPrefetcher --l2-hwp-type=DCPTPrefetcher
> --mem-type=DDR3_1600_8x8 -c ../stream/stream
> -------------------------------------------------------------
> Function    Best Rate MB/s  Avg time     Min time     Max time
> Copy:            3479.8     0.004598     0.004598     0.004598
> Scale:           3554.0     0.004502     0.004502     0.004502
> Add:             4595.0     0.005223     0.005223     0.005223
> Triad:           4705.9     0.005100     0.005100     0.005100
> -------------------------------------------------------------
>
> The busutil of DRAM also improved:
> -------------------------------------------------------------
> system.mem_ctrls.dram.bytesRead          239947840  # Total bytes read
> (Byte)
> system.mem_ctrls.dram.bytesWritten       121160640  # Total bytes written
> (Byte)
> system.mem_ctrls.dram.avgRdBW          1611.266685  # Average DRAM read
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.avgWrBW           813.602251  # Average DRAM write
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.peakBW              12800.00  # Theoretical peak
> bandwidth in MiByte/s ((Byte/Second))
> system.mem_ctrls.dram.busUtil                18.94  # Data bus utilization
> in percentage (Ratio)
> system.mem_ctrls.dram.busUtilRead            12.59  # Data bus utilization
> in percentage for reads (Ratio)
> system.mem_ctrls.dram.busUtilWrite            6.36  # Data bus utilization
> in percentage for writes (Ratio)
> system.mem_ctrls.dram.pageHitRate            89.16  # Row buffer hit rate,
> read and write combined (Ratio)
> -------------------------------------------------------------
>
> It’s indeed improving the achieved bandwidth, but still a little far away
> from the peak bandwidth of DDR3_1600 (12800 MiB/s). stats.txt is uploaded
> for reference (
> https://gist.github.com/wzc314/cf29275f853ee0b2fcd865f9b492c355)
>
> Any idea is appreciated!
> Thank you in advance!
>
> Bests,
> Zicong
>
>
>
> 2022年4月16日 00:08,Majid Jalili <majid...@gmail.com> 写道:
>
> Hi,
> Make sure your system has enough MSHRs, out of the box, L1, and L2 are set
> to have a few MSHR entries.
> Also, stride prefetcher is not the best, you may try something better:
> DCPT gives me better numbers.
>
> On Fri, Apr 15, 2022 at 4:57 AM Zicong Wang via gem5-users <
> gem5-users@gem5.org> wrote:
> Hi Jason,
>
>   We are testing the memory bandwidth program STREAM ​(
> https://www.cs.virginia.edu/stream/)​, but the results show that the CPU
> cannot fully utilize the DDR bandwidth, and the achieved bandwidth is quite
> low and about 1/10 of the peak bandwidth (peakBW in stats.txt). I tested
> the STREAM binary on my x86 computer and got the near peak bandwidth, so I
> believe the program is ok.
>
>   I've seen the maillist dialogue
> https://www.mail-archive.com/gem5-users@gem5.org/msg12965.html, and
> I think I've met the similar problem. So I tried the suggestions proposed
> by ​Andreas, including ​enable l1/l2 prefetcher, ​​​​​​​​​using ARM
> detailed CPU. Although these methods can improve the bandwidth, the results
> show it has limited effect. Besides, I've also tested the STREAM program in
> FS mode with x86 O3/Minor/TimingSimple CPU, and tested it in SE mode with
> ruby option, but all the results are similar and there is no essential
> difference.
>
>   I guess it is a general problem in simulation with gem5. I'm wondering
> if the result is expected or is there something wrong with the system model?
>
>   Two of the experimental results are attached for reference:
>
> 1. X86 O3CPU, SE-mode, w/o l2 prefetcher:
>
> ./build/X86/gem5.opt --outdir=m5out-stream configs/example/se.py
> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache
> --l2_size=8MB --mem-type=DDR3_1600_8x8 -c ../stream/stream
>
> STREAM output:​
>
> -------------------------------------------------------------
> Function    Best Rate MB/s     Avg time     Min time     Max time
> Copy:            1099.0     0.014559     0.014559     0.014559
> Scale:           1089.7     0.014683     0.014683     0.014683
> Add:             1213.0     0.019786     0.019786     0.019786
> Triad:           1222.1     0.019639     0.019639     0.019639
> -------------------------------------------------------------
>
> stats.txt (dram related):
>
> system.mem_ctrls.dram.bytesRead          238807808   # Total bytes read
> (Byte)
> system.mem_ctrls.dram.bytesWritten       121179776   # Total bytes written
> (Byte)
> system.mem_ctrls.dram.avgRdBW           718.689026   # Average DRAM read
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.avgWrBW           364.688977   # Average DRAM write
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.peakBW              12800.00   # Theoretical peak
> bandwidth in MiByte/s ((Byte/Second))
> system.mem_ctrls.dram.busUtil                 8.46   # Data bus
> utilization in percentage (Ratio)
> system.mem_ctrls.dram.busUtilRead             5.61   # Data bus
> utilization in percentage for reads (Ratio)
> system.mem_ctrls.dram.busUtilWrite            2.85   # Data bus
> utilization in percentage for writes (Ratio)
> system.mem_ctrls.dram.pageHitRate            40.57   # Row buffer hit
> rate, read and write combined (Ratio)
>
>
>
> 2. X86 O3CPU, SE-mode, w/ l2 prefetcher:
>
> ​./build/X86/gem5.opt --outdir=m5out-stream-l2hwp configs/example/se.py
> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache
> --l2_size=8MB --l2-hwp-typ=StridePrefetcher --mem-type=DDR3_1600_8x8 -c
> ../stream/stream
>
> STREAM output:​
>
> -------------------------------------------------------------
> Function    Best Rate MB/s     Avg time     Min time     Max time
> Copy:            1703.9     0.009390     0.009390     0.009390
> Scale:           1718.6     0.009310     0.009310     0.009310
> Add:             2087.3     0.011498     0.011498     0.011498
> Triad:           2227.2     0.010776     0.010776     0.010776
> -------------------------------------------------------------
> stats.txt (dram related):
>
> system.mem_ctrls.dram.bytesRead          238811712   # Total bytes read
> (Byte)
> system.mem_ctrls.dram.bytesWritten       121179840   # Total bytes written
> (Byte)
> system.mem_ctrls.dram.avgRdBW          1014.129912   # Average DRAM read
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.avgWrBW           514.598298   # Average DRAM write
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.peakBW              12800.00   # Theoretical peak
> bandwidth in MiByte/s ((Byte/Second))
> system.mem_ctrls.dram.busUtil                11.94   # Data bus
> utilization in percentage (Ratio)
> system.mem_ctrls.dram.busUtilRead             7.92   # Data bus
> utilization in percentage for reads (Ratio)
> system.mem_ctrls.dram.busUtilWrite            4.02   # Data bus
> utilization in percentage for writes (Ratio)
> system.mem_ctrls.dram.pageHitRate            75.37   # Row buffer hit
> rate, read and write combined (Ratio)
>
>
>
> STREAM compiling options:
>
> gcc -O2 -static -DSTREAM_ARRAY_SIZE=1000000 -DNTIMES=2 stream.c -o stream​
>
> All the experiments are performed on the latest stable
> version (141cc37c2d4b93959d4c249b8f7e6a8b2ef75338, v21.2.1).
>
>   Thank you very much!
>
>
>
> Best Regards,
>
> Zicong
>
>
>
> _______________________________________________
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
>
>
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to