Hi all, The classic memory system avoids a lot of complexity in the cache state machines by performing the state transitions in zero time. Note that it does not complete the packet transfer in zero time though, and it pays for the instant request propagation either in the downstream component, or on the response path. There are fields in the packet that accumulate the “unpaid” snoop latency. You can run a multi-core lmbench-like benchmark if you want to convince yourself it is doing the right thing.
As a result of the aforementioned functionality, I would argue the classic memory system is actually a good representation of a hierarchical crossbar-based system with a MOESI protocol. It is also a lot faster than Ruby, and far more flexible. In the end it depends on what you want to accomplish. For most system-level performance exploration I would suggest classic. For detailed interconnect topologies or coherency protocols, go with Ruby. I hope that helps. Andreas From: gem5-users <[email protected]<mailto:[email protected]>> on behalf of Jason Lowe-Power <[email protected]<mailto:[email protected]>> Reply-To: gem5 users mailing list <[email protected]<mailto:[email protected]>> Date: Monday, 7 November 2016 at 14:30 To: gem5 users mailing list <[email protected]<mailto:[email protected]>> Subject: Re: [gem5-users] Understanding of cache trace of ALPHA timing CPU In addition to what Rodrigo says, if you want to model a cache coherent memory system in detail, you should be using the Ruby memory system, no the classic caches. Ruby performs all coherence actions in detailed timing mode. Also, for an in order CPU, you may want to try out the MinorCPU. It works well with ARM, and somewhat with x86. I'm not sure if it will work with Alpha. Cheers, Jason On Mon, Nov 7, 2016 at 5:52 AM Rodrigo Cataldo <[email protected]<mailto:[email protected]>> wrote: Hello Mengyu Liang, i would recommend that you check out the thesis of Uri Wiener (Modeling and Analysis of a Cache Coherent Interconnect) he describes the decisions made on the implementation of the CCI model on gem5. quoting page 25: "Snoop requests from the slave are handled and forwarded in zero time. This major inaccuracy is intended for avoiding race conditions in the memory system, and mostly the need to implement transition-states in the cache-controller." On Sun, Nov 6, 2016 at 7:03 PM, mengyu liang <[email protected]<mailto:[email protected]>> wrote: Hello everyone, Recently I am studying the memory access time, (i.e. the duration of memory load and store) in term of CPU cycles in a multicore system. I come up with alpha timing CPU and have run several Full system simulation with Parsec workloads. In order to look into details of memory access procedure, I turned on the debug trace of Cache. However I am very disappointed to see that the entire memory access is treated "atomically". To illustrate my doubt, I paste the following Cache trace segment: 3587305218000: system.cpu3.dcache: ReadReq addr 0x6bcac8 size 8 (ns) miss 3587305218000: system.cpu3.dcache: createMissPacket created ReadSharedReq from ReadReq for addr 0x6bcac0 size 32 3587305218000: system.cpu3.dcache: Sending an atomic ReadSharedReq for 0x6bcac0 (ns) 3587305218000: system.cpu0.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03 3587305218000: system.cpu0.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache 3587305218000: system.cpu1.dcache: handleSnoop snoop hit for CleanEvict addr 0x8601c0 size 32, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 tag: 10c03 3587305218000: system.cpu1.dcache: Found addr 0x8601c0 in upper level cache for snoop CleanEvict from lower cache 3587305218000: system.cpu3.dcache: Receive response: ReadResp for addr 0x6bcac0 (ns) in state 0 3587305218000: system.cpu3.dcache: replacement: replacing 0x3f0d0040 (ns) with 0x6bcac0 (ns): writeback 3587305218000: system.cpu3.dcache: Create Writeback 0x3f0d0040 writable: 1, dirty: 1 3587305218000: system.cpu3.dcache: Block addr 0x6bcac0 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 tag: d795 As you can see above, cpu3 initiates a read request at the very beginning but encountered cache miss. So it has triggered a series of cache actions due to cache coherency. However they ALL take place at the same time tick, as if every memory access, no matter if it is cache miss or hit, takes ZERO time! As per the documentation of gem5, The TimingSimpleCPU is the version of SimpleCPU that uses timing memory accesses. It stalls on cache accesses and waits for the memory system to respond prior to proceeding. Based on that, I didn't expect an atomic-like behavior of timing CPU. It should have exhibited non-zero duration for each memory access. Does anybody have the same experience and can explain the reason for that? Or is there any CPU model which behaves non-atomically and can be implemented in multicore system? As far as I know, only O3 CPU does this job, however it's out of order. I need an in-order CPU. Thanks and best regards, Mengyu Liang _______________________________________________ gem5-users mailing list [email protected]<mailto:[email protected]> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users _______________________________________________ gem5-users mailing list [email protected]<mailto:[email protected]> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________ gem5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
