[gem5-users] Why are the detailed ARM CPU's memory hierarchy access times all crazy?

Gabriel Yessin Fri, 08 Feb 2013 19:31:32 -0800

Hey all,

I really need some input on this one.


I was running bbench and noticed the run times for architectures with any
sized a L2 cache were MUCH slower than any architecture with no L2 cache.

For instance, loading Twitter, two of the warm start times per architecture
were:
0.5 GHz, 16kB L1 Inst Cache, 16KB Data Cache No L2 Cache:  2.547 seconds
0.5 GHz, 16kB L1 Inst Cache, 16KB Data Cache,1024 kB L2 Cache: 5.498 seconds

That's a factor of 2 slowdown for using L2 caches vs no L2 caches.

More results for Twitter (I have even more than this, but just want to show
the pattern):
1.0GHz, 16kB L1 Inst Cache, 16KB Data Cache, No L2 Cache:1.666 seconds
1.0GHz, 16kB L1 Inst Cache, 16KB Data Cache,1024 kB L2 Cache: 2.002 seconds

1.5 GHz, 16kB L1 Inst Cache, 16KB Data Cache, No L2 Cache: 1.697 seconds
1.5 GHz, 16kB L1 Inst Cache, 16KB Data Cache,1024 kB L2 Cache: 1.991 seconds
1.5 GHz, 16kB L1 Inst Cache, 16KB Data Cache, 2048 kB L2 Cache:1.578 seconds

*My basic commands:*

./build/ARM/gem5.fast -v --dump-config=config_single_twitter.ini
--outdir=m5out_single_twitter_05GHz_64kB_0kB configs/example/fs.py -b
bbench-gb
--kernel=/home/gyessin/bbench1site/dist_twitter/m5/system/binaries/vmlinux.smp.mouse.arm
--frame-capture --checkpoint-dir=checkpoint_single_twitter
--disk-image=/home/gyessin/bbench1site/dist_twitter/m5/system/disks/ARMv7a-Gingerbread-Android.SMP.mouse.nolock.img
--caches -s 300000000 -r 1 --l1d_size=64kB --l1i_size=64kB --clock=0.5GHz

and

./build/ARM/gem5.fast -v --dump-config=config_single_msn.ini
--outdir=m5out_single_msn_05GHz_16kB_1024kB configs/example/fs.py -b
bbench-gb
--kernel=/home/gyessin/bbench1site/dist_msn/m5/system/binaries/vmlinux.smp.mouse.arm
--frame-capture --checkpoint-dir=checkpoint_single_msn
--disk-image=/home/gyessin/bbench1site/dist_msn/m5/system/disks/ARMv7a-Gingerbread-Android.SMP.mouse.nolock.img
--caches -s 300000000 -r 1 --l1d_size=16kB --l1i_size=16kB --l2cache
--l2_size=1024kB --clock=0.5GHz

(They're restoring from a checkpoint taken right after the sleep 10
in gem5/configs/boot/bbench-gb.rcS and they are running from a version
cloned only a few days ago from the development repository)



Looking at configs/common/O3_ARM_v7a.py (Relevant bits copied and
highlighted below), unless I'm misinterpreting something, it would appear
that:
L1 Instruction latency = 1 cycle (reasonable)
L1 Data latency = 2 cycles (reasonable)
TLB Cache Latency = 4 cycles (a little low, I think, but fine)
L2 Cache Latency = 12 cycles (reasonable)
*Memory Write Latency = Memory ReadLatency = 2 cycles (AS LOW AS L1
DATA?!!! Seem absurd!) *
*
*
Am I understanding this right or did I misinterpret the code? It really
seems absurd, I would assume MemWrite and MemRead should be about 200 cpu
cycles, correct?

*By the way, I'm not trying to be inflammatory or insult anyone who might
have edited the code, just trying to get to the bottom this asap so I can
meet my paper deadlines.

Any input on this would be greatly appreciated.

*Relevant Parts of O3_ARM_v7a.py:*
....
# Load/Store Units
class O3_ARM_v7a_Load(FUDesc):
    opList = [ OpDesc(opClass='MemRead',opLat=*2*) ]
    count = 1

class O3_ARM_v7a_Store(FUDesc):
    opList = [OpDesc(opClass='MemWrite',opLat=*2*) ]
    count = 1
....

# Instruction Cache
class O3_ARM_v7a_ICache(BaseCache):
    hit_latency = *1*
    response_latency = *1*
...

# Data Cache
class O3_ARM_v7a_DCache(BaseCache):
    hit_latency = *2*
    response_latency = *2*
...

# TLB Cache
# Use a cache as a L2 TLB
class O3_ARM_v7aWalkCache(BaseCache):
    hit_latency = *4*
    response_latency = *4*

...

# L2 Cache
class O3_ARM_v7aL2(BaseCache):
    hit_latency = *12*
    response_latency = *12*
...

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

[gem5-users] Why are the detailed ARM CPU's memory hierarchy access times all crazy?

Reply via email to