Quoting Vince Weaver <[email protected]>: > On Wed, 16 Dec 2009, Steve Reinhardt wrote: > >> On Sun, Dec 13, 2009 at 8:57 PM, Vince Weaver <[email protected]> wrote: >> > I did finish running and verifying spec2k on x86_64 (it took longer than >> > it should have due to an unfortunate power-outage on our cluster). The >> > benchmarks all finished, and the retired instruction count matches actual >> > hardware perf counters very closely. >> > >> > http://www.csl.cornell.edu/~vince/projects/m5/m5_x86_64_se_status.html >> >> Wow, this is awesome! I missed this the first time through (didn't >> scroll down to the end of the message). Thanks for all the effort, >> Vince. >> >> Are you tracking uops as well as instructions? I'm curious how close >> we are on that. > > uops for m5 are currently about 1.5x too many, when compared to AMD Phenom > and Intel Core2 (slightly better, but not much, when compared against a > Pentium D). > > It's slightly worse than 1.5 on integer spec2k and slightly better on fp. > > uops are tricky to get right, I imagine the values will be off unless you > carefully use perf-counters and other tricks (or else have inside > knowledge) to match real hardware. And even then, you'd only match a > particular x86 imlementation, there's wide variation between the various > generations. I think PTLSim goes through a lot of trouble to make their > uop counts match an AMD system, but I don't know how close they manage to > get. > > besides retired instructions, m5 also does a good job (compared to real > hardware) with L1 dcache accesses. I was hoping to validate some of the > other stats, but it's hard to do that with OoO and detailed simulation not > supported on x86. > > Vince
I've been thinking about this since reading your email, and it occurs to me the microops may be loads, ops, stores, or opstores and still roughly fall into a RISC style architecture. Stores have to wait around in the store queue anyway, so they could wait for their data to be generated by the ALU without a significant penalty. The most common sort of macroop is a load/op/store where one operand is in memory. In those cases, if you merge the op and the store, you'd go from 3 ops to 2, explaining (in this simplified version of the world) the 1.5x difference. If you look at the SSE instructions, this sort of single memory operation and computation merging is how a lot of them are organized, although perhaps loadops instead of opstores (I forget the details). Gabe _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
