[m5-users] Measuring base CPI relative to L2 using interval analysis

Jeroen DR Thu, 28 Apr 2011 14:04:08 -0700

Hi,

I'm currently working on an analysis of L2 NUCA caches, and as part ofthis I'm looking to produce a CPI stack detailing the relativeimportance of each miss latency component at and beyond the L2 level. Todo this, I'm first looking to obtain the base CPI from L2's perspective,i.e. the CPI obtained if all latencies at and beyond the L2 level wereto be zero. This will yield the amount of cycles per instruction the CPUspends that are unrelated to L2+ activity, and hence, the amount ofcycles that will remain no matter what happens at L2 and lower.

I can measure this by setting all L2+ latencies to zero (or as close tozero as possible); however, I would also like to be able to model thisbase CPI so that it can be computed from an execution profile with theL2+ latencies intact. To do this, I'm using the interval analysis modelas outlined by Eyerman, S.; Eeckhout, L.; Karkhanis T. and Smith, J.E.in "A Top-Down Approach to Architecting CPI Component PerformanceCounters". For non-overlapping long latency loads, this model predictsthat the CPU will experience a cycle penalty given the amount of timebetween the moment where the ROB fills up and IPC starts to drop (due todependencies on the missed load) and the moment where the load returns.This model also recognizes that multiple outstanding long-latency loadsmay overlap, so that the cycle penalty to the CPU is reduced byapproximately the level of MLP achieved during the overlapping misses.

So to apply this model, I will need to measure the amount of MLP at theL2 level, the amount of L2 misses (i.e. long-latency loads) and theentire latency of an L2 miss (i.e. all the way from CPU to memory andback). I have done this, but I have a couple of questions about whetherI'm measuring them correctly.

I should note that I'm using an older version of M5: revision 6819. Irealize that there is a newer version available and that it fixes manybugs, but the effort of porting my modifications to the newer version issignificant and my time is limited. There is one bug (or at least, Ithink it is a bug) that I believe may be relevant to this problem as itapparently made all writebacks from L1 to L2 end up as shared in L2 evenif there's only one CPU. This caused many unnecessary misses in L2 forrequests that needed exclusive blocks as they would all fail theexclusive permission check, but I believe it might also affect this nextmeasurement:

1) To measure MLP, I've created an event that executes every cycle andcounts the amount of L2 MSHR targets (when there is at least one target)using the "ntargets" field of the MSHR class. Intuitively, this shouldcorrespond to the amount of simultaneously outstanding L2 misses andhence the level of MLP. However, I've noticed in the source code thatthere are actually two kinds of targets: regular targets and deferredtargets.

From what I've gathered, when an MSHR request completes, the regulartargets are satisfied and then swapped out with the deferred targets, sothat now the MSHR will issue another request for the now-regulartargets. This implies that including the deferred targets in the targetcount for measuring MLP is wrong, because they are processed seriallyafter the regular ones, and not simultaneously.

The logic for dealing with deferred targets is quite complex and Ihaven't yet been able to grasp it fully, but I do understand that it isclosely related to the handling of exclusive requests, so I think thebug I mentioned earlier might be involved here. Can anyone advisewhether excluding or including deferred targets is the correct way ofmeasuring MLP?

2) To measure the latencies of an L2 miss, I figured I could just usethe system.l2.demand_avg_miss_latency stat, but I wanted to verify whatexactly is being measured here and whether or not the degree of MLP isalready included. From what I can see from the source code, this statrecords the latency between the moment when the request was entered intoits MSHR target and the time when the data becomes available from thebus (either the first word or the entire packet data), and this for eachrequest entirely separately. In other words, this measures the latencyfor an individual request to individually miss in L2, go to memory andcome back to L2, and hence does not account for MLP right?

So then would it be correct to say that the L1 demand_avg_miss_latencystat includes the average latencies of requests that go on to furthermiss in L2 as well as those that go on to hit in L2? I ask because Inoticed the L1 demand_avg_miss_latency to sometimes be lower than L2's,and just wanted to verify that this is indeed caused by L1 miss/L2 hitrequests. How would I then go about measuring the latencies from L1 andbeyond but only for those requests that go on to miss in L2? I can addtwice the L1 hit latency to L2's demand_avg_miss_latency, but thisignores the time spent on the L1/L2 bus.

3) Finally, I am somewhat concerned whether I am correctly measuring theamount of long-latency loads (i.e. those that will stall the CPU if theytake too long). Right now I am counting only ReadExReq, ReadReq andWriteReq L2 misses as long-latency loads; exactly the amount measured bysystem.l2.demand_misses (and whose average latency is hence given bydemand_avg_miss_latency). I'm only interested in requests that willcause the CPU's instruction window to block while it awaits theresponse, and I think that should about cover it, but I'm not sure. Cananyone confirm whether these are indeed all the memory packet types thatwill cause a CPU penalty upon missing in L2?


Many thanks in advance!

Best regards,
-- Jeroen DR
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

[m5-users] Measuring base CPI relative to L2 using interval analysis

Reply via email to