Dear all,
Thanks a lot for all your explanation below. I'm now sticking to the classical
Xbar memory system, not the ruby one. I accept the fact that the state
transition or cache coherency takes zero time in this case.
However today I studied the exec debug trace again for ALPHA FS simulation and
found the following interesting entries:
3334580479000: system.switch_cpus02 T0 : 0x12000867c : ldq
r2,29968(r1) : MemRead : A=0x1200adda8
......
3334580495000: system.switch_cpus02 T0 : 0x12000867c : ldq
r2,29968(r1) : MemRead : D=0x00000001200adda8 A=0x1200adda8
You see at the first entry cpu02 tries to read address from A=0x1200adda8 but
without data. Some time later at entry 2 I found the same core at the same
instruction address is accessing the same data address with the same registers.
But this time a valid data is returned D=0x00000001200adda8.
Can I explain this as memory access request at the 1st entry, and data
acknowledgement at the 2nd entry? Does it have something to do with Cache miss?
If you compare this with the cache debug trace, you will find out that the 1st
entry is not noted in cache trace. We have only notation in cache trace for 2nd
entry.
Then what happened at first entry?
I would like to say, this kind of accesses take only a very small percentage of
all memory accesses. Most memory accesses acquired the data already at their
first entries.
Also there are other kind of memory accesses in exec trace which have neither
data address A=0x.... or returned data D=0x... example is below:
3334580433000: system.switch_cpus00 T0 : @iowrite8+36 : mb
: MemRead :
How to explain this?
PS: I still don't know how to reply and hanging my post onto an existing topic
in gem5 mailing list? instead of opening a new topic?
Thanks in advance.
Best regards,
Mengyu
________________________________
Von: mengyu liang <[email protected]>
Gesendet: Sonntag, 6. November 2016 22:03
An: gem5 forum
Betreff: Understanding of cache trace of ALPHA timing CPU
Hello everyone,
Recently I am studying the memory access time, (i.e. the duration of memory
load and store) in term of CPU cycles in a multicore system. I come up with
alpha timing CPU and have run several Full system simulation with Parsec
workloads. In order to look into details of memory access procedure, I turned
on the debug trace of Cache.
However I am very disappointed to see that the entire memory access is treated
"atomically". To illustrate my doubt, I paste the following Cache trace segment:
3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from
ReadReq for addr 0x6bcac0 size 32
3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0
(ns)
3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1
dirty: 0 tag: 10c03
3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for
snoop CleanEvict from lower cache
3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1
dirty: 0 tag: 10c03
3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for
snoop CleanEvict from lower cache
3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0
(ns) in state 0
3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with
0x6bcac0 (ns): writeback
3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1,
dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0
to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795
As you can see above, cpu3 initiates a read request at the very beginning but
encountered cache miss. So it has triggered a series of cache actions due to
cache coherency. However they ALL take place at the same time tick, as if every
memory access, no matter if it is cache miss or hit, takes ZERO time!
As per the documentation of gem5, The TimingSimpleCPU is the version of
SimpleCPU that uses timing memory accesses. It stalls on cache accesses and
waits for the memory system to respond prior to proceeding. Based on that, I
didn't expect an atomic-like behavior of timing CPU. It should have exhibited
non-zero duration for each memory access.
Does anybody have the same experience and can explain the reason for that?
Or is there any CPU model which behaves non-atomically and can be implemented
in multicore system? As far as I know, only O3 CPU does this job, however it's
out of order. I need an in-order CPU.
Thanks and best regards,
Mengyu Liang
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users