Dear all,
Thanks a lot for all your explanation below. I'm now sticking to the classical
Xbar memory system, not the ruby one. I accept the fact that the state
transition or cache coherency takes zero time in this case.
However today I studied the exec debug trace again for ALPHA FS simulation and
found the following interesting entries:
3334580479000: system.switch_cpus02 T0 : 0x12000867c : ldq
r2,29968(r1) : MemRead : A=0x1200adda8
......
3334580495000: system.switch_cpus02 T0 : 0x12000867c : ldq
r2,29968(r1) : MemRead : D=0x00000001200adda8 A=0x1200adda8
You see at the first entry cpu02 tries to read address from A=0x1200adda8 but
without data. Some time later at entry 2 I found the same core at the same
instruction address is accessing the same data address with the same registers.
But this time a valid data is returned D=0x00000001200adda8.
Can I explain this as memory access request at the 1st entry, and data
acknowledgement at the 2nd entry? Does it have something to do with Cache miss?
If you compare this with the cache debug trace, you will find out that the 1st
entry is not noted in cache trace. We have only notation in cache trace for 2nd
entry.
Then what happened at first entry?
I would like to say, this kind of accesses take only a very small percentage of
all memory accesses. Most memory accesses acquired the data already at their
first entries.
Also there are other kind of memory accesses in exec trace which have neither
data address A=0x.... or returned data D=0x... example is below:
3334580433000: system.switch_cpus00 T0 : @iowrite8+36 : mb
: MemRead :
How to explain this?
PS: I still don't know how to reply and hanging my post onto an existing topic
in gem5 mailing list? instead of opening a new topic?
Thanks in advance.
Best regards,
Mengyu
Hi all,
The classic memory system avoids a lot of complexity in the cache state
machines by performing the state transitions in zero time. Note that it does
not complete the packet transfer in zero time though, and it pays for the
instant request propagation either in the downstream component, or on the
response path. There are fields in the packet that accumulate the “unpaid”
snoop latency. You can run a multi-core lmbench-like benchmark if you want to
convince yourself it is doing the right thing.
As a result of the aforementioned functionality, I would argue the classic
memory system is actually a good representation of a hierarchical
crossbar-based system with a MOESI protocol. It is also a lot faster than Ruby,
and far more flexible. In the end it depends on what you want to accomplish.
For most system-level performance exploration I would suggest classic. For
detailed interconnect topologies or coherency protocols, go with Ruby.
I hope that helps.
Andreas
From: gem5-users
<[email protected]<mailto:[email protected]>> on behalf of
Jason Lowe-Power <[email protected]<mailto:[email protected]>>
Reply-To: gem5 users mailing list
<[email protected]<mailto:[email protected]>>
Date: Monday, 7 November 2016 at 14:30
To: gem5 users mailing list <[email protected]<mailto:[email protected]>>
Subject: Re: [gem5-users] Understanding of cache trace of ALPHA timing CPU
In addition to what Rodrigo says, if you want to model a cache coherent memory
system in detail, you should be using the Ruby memory system, no the classic
caches. Ruby performs all coherence actions in detailed timing mode.
Also, for an in order CPU, you may want to try out the MinorCPU. It works well
with ARM, and somewhat with x86. I'm not sure if it will work with Alpha.
Cheers,
Jason
On Mon, Nov 7, 2016 at 5:52 AM Rodrigo Cataldo
<[email protected]<mailto:[email protected]>> wrote:
Hello Mengyu Liang,
i would recommend that you check out the thesis of Uri Wiener (Modeling and
Analysis of a Cache Coherent Interconnect)
he describes the decisions made on the implementation of the CCI model on gem5.
quoting page 25: "Snoop requests from the slave are handled and forwarded in
zero time. This major inaccuracy is intended for avoiding race conditions in
the memory system, and mostly the need to implement transition-states in the
cache-controller."
On Sun, Nov 6, 2016 at 7:03 PM, mengyu liang
<[email protected]<mailto:[email protected]>> wrote:
Hello everyone,
Recently I am studying the memory access time, (i.e. the duration of memory
load and store) in term of CPU cycles in a multicore system. I come up with
alpha timing CPU and have run several Full system simulation with Parsec
workloads. In order to look into details of memory access procedure, I turned
on the debug trace of Cache.
However I am very disappointed to see that the entire memory access is treated
"atomically". To illustrate my doubt, I paste the following Cache trace segment:
3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from
ReadReq for addr 0x6bcac0 size 32
3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0
(ns)
3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1
dirty: 0 tag: 10c03
3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for
snoop CleanEvict from lower cache
3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1
dirty: 0 tag: 10c03
3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for
snoop CleanEvict from lower cache
3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0
(ns) in state 0
3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with
0x6bcac0 (ns): writeback
3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1,
dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0
to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795
As you can see above, cpu3 initiates a read request at the very beginning but
encountered cache miss. So it has triggered a series of cache actions due to
cache coherency. However they ALL take place at the same time tick, as if every
memory access, no matter if it is cache miss or hit, takes ZERO time!
As per the documentation of gem5, The TimingSimpleCPU is the version of
SimpleCPU that uses timing memory accesses. It stalls on cache accesses and
waits for the memory system to respond prior to proceeding. Based on that, I
didn't expect an atomic-like behavior of timing CPU. It should have exhibited
non-zero duration for each memory access.
Does anybody have the same experience and can explain the reason for that?
Or is there any CPU model which behaves non-atomically and can be implemented
in multicore system? As far as I know, only O3 CPU does this job, however it's
out of order. I need an in-order CPU.
Thanks and best regards,
Mengyu Liang
_______________________________________________
gem5-users mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users