Hi all,

The classic memory system avoids a lot of complexity in the cache state 
machines by performing the state transitions in zero time. Note that it does 
not complete the packet transfer in zero time though, and it pays for the 
instant request propagation either in the downstream component, or on the 
response path. There are fields in the packet that accumulate the “unpaid” 
snoop latency. You can run a multi-core lmbench-like benchmark if you want to 
convince yourself it is doing the right thing.

As a result of the aforementioned functionality, I would argue the classic 
memory system is actually a good representation of a hierarchical 
crossbar-based system with a MOESI protocol. It is also a lot faster than Ruby, 
and far more flexible. In the end it depends on what you want to accomplish. 
For most system-level performance exploration I would suggest classic. For 
detailed interconnect topologies or coherency protocols, go with Ruby.

I hope that helps.

Andreas

From: gem5-users 
<[email protected]<mailto:[email protected]>> on behalf of 
Jason Lowe-Power <[email protected]<mailto:[email protected]>>
Reply-To: gem5 users mailing list 
<[email protected]<mailto:[email protected]>>
Date: Monday, 7 November 2016 at 14:30
To: gem5 users mailing list <[email protected]<mailto:[email protected]>>
Subject: Re: [gem5-users] Understanding of cache trace of ALPHA timing CPU

In addition to what Rodrigo says, if you want to model a cache coherent memory 
system in detail, you should be using the Ruby memory system, no the classic 
caches. Ruby performs all coherence actions in detailed timing mode.

Also, for an in order CPU, you may want to try out the MinorCPU. It works well 
with ARM, and somewhat with x86. I'm not sure if it will work with Alpha.

Cheers,
Jason

On Mon, Nov 7, 2016 at 5:52 AM Rodrigo Cataldo 
<[email protected]<mailto:[email protected]>> wrote:
Hello Mengyu Liang,
i would recommend that you check out the thesis of Uri Wiener (Modeling and 
Analysis of a Cache Coherent Interconnect)

he describes the decisions made on the implementation of the CCI model on gem5.

quoting page 25: "Snoop requests from the slave are handled and forwarded in 
zero time. This major inaccuracy is intended for avoiding race conditions in 
the memory system, and mostly the need to implement transition-states in the
cache-controller."

On Sun, Nov 6, 2016 at 7:03 PM, mengyu liang 
<[email protected]<mailto:[email protected]>> wrote:

Hello everyone,

Recently I am studying the memory access time, (i.e. the duration of memory 
load and store) in term of CPU cycles in a multicore system. I come up with 
alpha timing CPU and have run several Full system simulation with Parsec 
workloads. In order to look into details of memory access procedure, I turned 
on the debug trace of Cache.

However I am very disappointed to see that the entire memory access is treated 
"atomically". To illustrate my doubt, I paste the following Cache trace segment:


3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss
3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from 
ReadReq for  addr 0x6bcac0 size 32
3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0 
(ns)
3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 
dirty: 0 tag: 10c03
3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for 
snoop CleanEvict from lower cache
3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 
0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 
dirty: 0 tag: 10c03
3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for 
snoop CleanEvict from lower cache
3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 
(ns) in state 0
3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with 
0x6bcac0 (ns): writeback
3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, 
dirty: 1
3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0 
to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795


As you can see above, cpu3 initiates a read request at the very beginning but 
encountered cache miss. So it has triggered a series of cache actions due to 
cache coherency. However they ALL take place at the same time tick, as if every 
memory access, no matter if it is cache miss or hit, takes ZERO time!


As per the documentation of gem5, The TimingSimpleCPU is the version of 
SimpleCPU that uses timing memory accesses. It stalls on cache accesses and 
waits for the memory system to respond prior to proceeding. Based on that, I 
didn't expect an atomic-like behavior of timing CPU. It should have exhibited 
non-zero duration for each memory access.


Does anybody have the same experience and can explain the reason for that?


Or is there any CPU model which behaves non-atomically and can be implemented 
in multicore system? As far as I know, only O3 CPU does this job, however it's 
out of order. I need an in-order CPU.


Thanks and best regards,

Mengyu Liang


_______________________________________________
gem5-users mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to