[gem5-users] Re: Low memory bandwidth achieved with STREAM benchmark

2022-04-15 Thread Majid Jalili via gem5-users
Hi,
Make sure your system has enough MSHRs, out of the box, L1, and L2 are set
to have a few MSHR entries.
Also, stride prefetcher is not the best, you may try something better: DCPT
gives me better numbers.

On Fri, Apr 15, 2022 at 4:57 AM Zicong Wang via gem5-users <
gem5-users@gem5.org> wrote:

> Hi Jason,
>
>   We are testing the memory bandwidth program STREAM *​*(
> https://www.cs.virginia.edu/stream/)​, but the results show that the CPU
> cannot fully utilize the DDR bandwidth, and the achieved bandwidth is quite
> low and about 1/10 of the peak bandwidth (peakBW in stats.txt). I tested
> the STREAM binary on my x86 computer and got the near peak bandwidth, so I
> believe the program is ok.
>
>   I've seen the maillist dialogue
> https://www.mail-archive.com/gem5-users@gem5.org/msg12965.html, and I
> think I've met the similar problem. So I tried the suggestions proposed by
> ​Andreas, including *​enable l1/l2 prefetcher*, *​**​**​**​​**using
> ARM detailed CPU*. Although these methods can improve the bandwidth, the
> results show it has limited effect. Besides, I've also tested the STREAM
> program in FS mode with x86 O3/Minor/TimingSimple CPU, and tested it in SE
> mode with ruby option, but all the results are similar and there is no
> essential difference.
>
>   I guess it is a general problem in simulation with gem5. I'm wondering
> if the result is expected or is there something wrong with the system
> model?
>
>   Two of the experimental results are attached for reference:
>
> *1. **X86 O3CPU, SE-mode, w/o l2 prefetcher:*
>
> ./build/X86/gem5.opt --outdir=m5out-stream configs/example/se.py
> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache
> --l2_size=8MB --mem-type=DDR3_1600_8x8 -c ../stream/stream
>
> *STREAM output:*​
> -
>
> FunctionBest Rate MB/s Avg time Min time Max time
> Copy:1099.0 0.014559 0.014559 0.014559
> Scale:   1089.7 0.014683 0.014683 0.014683
> Add: 1213.0 0.019786 0.019786 0.019786
> Triad:   1222.1 0.019639 0.019639 0.019639
> -
>
> *stats.txt (dram related):*
>
> system.mem_ctrls.dram.bytesRead  238807808   # Total bytes read
> (Byte)
> system.mem_ctrls.dram.bytesWritten   121179776   # Total bytes written
> (Byte)
> system.mem_ctrls.dram.avgRdBW   718.689026   # Average DRAM read
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.avgWrBW   364.688977   # Average DRAM write
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.peakBW  12800.00   # Theoretical peak
> bandwidth in MiByte/s ((Byte/Second))
> system.mem_ctrls.dram.busUtil 8.46   # Data bus
> utilization in percentage (Ratio)
> system.mem_ctrls.dram.busUtilRead 5.61   # Data bus
> utilization in percentage for reads (Ratio)
> system.mem_ctrls.dram.busUtilWrite2.85   # Data bus
> utilization in percentage for writes (Ratio)
> system.mem_ctrls.dram.pageHitRate40.57   # Row buffer hit
> rate, read and write combined (Ratio)
>
>
> *2**. X86 O3CPU, SE**-mode, w/* *l2 prefetcher:*
>
> ​./build/X86/gem5.opt --outdir=m5out-stream-l2hwp configs/example/se.py
> --cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache
> --l2_size=8MB --l2-hwp-typ=StridePrefetcher --mem-type=DDR3_1600_8x8 -c
> ../stream/stream
>
> *STREAM output:*​
> -
> FunctionBest Rate MB/s Avg time Min time Max time
> Copy:1703.9 0.009390 0.009390 0.009390
> Scale:   1718.6 0.009310 0.009310 0.009310
> Add: 2087.3 0.011498 0.011498 0.011498
> Triad:   2227.2 0.010776 0.010776 0.010776
> -
>
> *stats.txt (dram related):*
>
> system.mem_ctrls.dram.bytesRead  238811712   # Total bytes read
> (Byte)
> system.mem_ctrls.dram.bytesWritten   121179840   # Total bytes written
> (Byte)
> system.mem_ctrls.dram.avgRdBW  1014.129912   # Average DRAM read
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.avgWrBW   514.598298   # Average DRAM write
> bandwidth in MiBytes/s ((Byte/Second))
> system.mem_ctrls.dram.peakBW  12800.00   # Theoretical peak
> bandwidth in MiByte/s ((Byte/Second))
> system.mem_ctrls.dram.busUtil11.94   # Data bus
> utilization in percentage (Ratio)
> system.mem_ctrls.dram.busUtilRead 7.92   # Data bus
> utilization in percentage for reads (Ratio)
> system.mem_ctrls.dram.busUtilWrite4.02   # Data bus
> utilization in percentage for writes (Ratio)
> system.mem_ctrls.dram.pageHitRate75.37   # Row buffer hit
> rate, read and write combined 

[gem5-users] Low memory bandwidth achieved with STREAM benchmark

2022-04-15 Thread Zicong Wang via gem5-users
Hi Jason,

  We are testing the memory bandwidth program STREAM 
​(https://www.cs.virginia.edu/stream/)​, but the results show that the CPU 
cannot fully utilize the DDR bandwidth, and the achieved bandwidth is quite low 
and about 1/10 of the peak bandwidth (peakBW in stats.txt). I tested the STREAM 
binary on my x86 computer and got the near peak bandwidth, so I believe the 
program is ok.

  I've seen the maillist dialogue 
https://www.mail-archive.com/gem5-users@gem5.org/msg12965.html, and I think 
I've met the similar problem. So I tried the suggestions proposed by ​Andreas, 
including ​enable l1/l2 prefetcher, ​using ARM detailed CPU. Although 
these methods can improve the bandwidth, the results show it has limited 
effect. Besides, I've also tested the STREAM program in FS mode with x86 
O3/Minor/TimingSimple CPU, and tested it in SE mode with ruby option, but all 
the results are similar and there is no essential difference.

  I guess it is a general problem in simulation with gem5. I'm wondering if the 
result is expected or is there something wrong with the system model?

  Two of the experimental results are attached for reference:

1. X86 O3CPU, SE-mode, w/o l2 prefetcher:

./build/X86/gem5.opt --outdir=m5out-stream configs/example/se.py 
--cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache 
--l2_size=8MB --mem-type=DDR3_1600_8x8 -c ../stream/stream

STREAM output:​

-

FunctionBest Rate MB/s Avg time Min time Max time
Copy:1099.0 0.014559 0.014559 0.014559
Scale:   1089.7 0.014683 0.014683 0.014683
Add: 1213.0 0.019786 0.019786 0.019786
Triad:   1222.1 0.019639 0.019639 0.019639
-

stats.txt (dram related):

system.mem_ctrls.dram.bytesRead  238807808   # Total bytes read (Byte)
system.mem_ctrls.dram.bytesWritten   121179776   # Total bytes written 
(Byte)
system.mem_ctrls.dram.avgRdBW   718.689026   # Average DRAM read 
bandwidth in MiBytes/s ((Byte/Second))
system.mem_ctrls.dram.avgWrBW   364.688977   # Average DRAM write 
bandwidth in MiBytes/s ((Byte/Second))
system.mem_ctrls.dram.peakBW  12800.00   # Theoretical peak 
bandwidth in MiByte/s ((Byte/Second))
system.mem_ctrls.dram.busUtil 8.46   # Data bus utilization in 
percentage (Ratio)
system.mem_ctrls.dram.busUtilRead 5.61   # Data bus utilization in 
percentage for reads (Ratio)
system.mem_ctrls.dram.busUtilWrite2.85   # Data bus utilization in 
percentage for writes (Ratio)
system.mem_ctrls.dram.pageHitRate40.57   # Row buffer hit rate, 
read and write combined (Ratio)




2. X86 O3CPU, SE-mode, w/l2 prefetcher:

​./build/X86/gem5.opt --outdir=m5out-stream-l2hwp configs/example/se.py 
--cpu-type=O3CPU --caches --l1d_size=256kB --l1i_size=256kB --l2cache 
--l2_size=8MB --l2-hwp-typ=StridePrefetcher --mem-type=DDR3_1600_8x8 -c 
../stream/stream 

STREAM output:​

-
FunctionBest Rate MB/s Avg time Min time Max time
Copy:1703.9 0.009390 0.009390 0.009390
Scale:   1718.6 0.009310 0.009310 0.009310
Add: 2087.3 0.011498 0.011498 0.011498
Triad:   2227.2 0.010776 0.010776 0.010776
-

stats.txt (dram related):

system.mem_ctrls.dram.bytesRead  238811712   # Total bytes read (Byte)
system.mem_ctrls.dram.bytesWritten   121179840   # Total bytes written 
(Byte)
system.mem_ctrls.dram.avgRdBW  1014.129912   # Average DRAM read 
bandwidth in MiBytes/s ((Byte/Second))
system.mem_ctrls.dram.avgWrBW   514.598298   # Average DRAM write 
bandwidth in MiBytes/s ((Byte/Second))
system.mem_ctrls.dram.peakBW  12800.00   # Theoretical peak 
bandwidth in MiByte/s ((Byte/Second))
system.mem_ctrls.dram.busUtil11.94   # Data bus utilization in 
percentage (Ratio)
system.mem_ctrls.dram.busUtilRead 7.92   # Data bus utilization in 
percentage for reads (Ratio)
system.mem_ctrls.dram.busUtilWrite4.02   # Data bus utilization in 
percentage for writes (Ratio)
system.mem_ctrls.dram.pageHitRate75.37   # Row buffer hit rate, 
read and write combined (Ratio)




STREAM compiling options:

gcc -O2 -static -DSTREAM_ARRAY_SIZE=100 -DNTIMES=2 stream.c -o stream​

All the experiments are performed on the latest stable version 
(141cc37c2d4b93959d4c249b8f7e6a8b2ef75338, v21.2.1).

  Thank you very much!




Best Regards,

Zicong


___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org