Hi,

I'm currently working on an analysis of L2 NUCA caches, and as part of this I'm looking to produce a CPI stack detailing the relative importance of each miss latency component at and beyond the L2 level. To do this, I'm first looking to obtain the base CPI from L2's perspective, i.e. the CPI obtained if all latencies at and beyond the L2 level were to be zero. This will yield the amount of cycles per instruction the CPU spends that are unrelated to L2+ activity, and hence, the amount of cycles that will remain no matter what happens at L2 and lower.

I can measure this by setting all L2+ latencies to zero (or as close to zero as possible); however, I would also like to be able to model this base CPI so that it can be computed from an execution profile with the L2+ latencies intact. To do this, I'm using the interval analysis model as outlined by Eyerman, S.; Eeckhout, L.; Karkhanis T. and Smith, J.E. in "A Top-Down Approach to Architecting CPI Component Performance Counters". For non-overlapping long latency loads, this model predicts that the CPU will experience a cycle penalty given the amount of time between the moment where the ROB fills up and IPC starts to drop (due to dependencies on the missed load) and the moment where the load returns. This model also recognizes that multiple outstanding long-latency loads may overlap, so that the cycle penalty to the CPU is reduced by approximately the level of MLP achieved during the overlapping misses.

So to apply this model, I will need to measure the amount of MLP at the L2 level, the amount of L2 misses (i.e. long-latency loads) and the entire latency of an L2 miss (i.e. all the way from CPU to memory and back). I have done this, but I have a couple of questions about whether I'm measuring them correctly.

I should note that I'm using an older version of M5: revision 6819. I realize that there is a newer version available and that it fixes many bugs, but the effort of porting my modifications to the newer version is significant and my time is limited. There is one bug (or at least, I think it is a bug) that I believe may be relevant to this problem as it apparently made all writebacks from L1 to L2 end up as shared in L2 even if there's only one CPU. This caused many unnecessary misses in L2 for requests that needed exclusive blocks as they would all fail the exclusive permission check, but I believe it might also affect this next measurement:

1) To measure MLP, I've created an event that executes every cycle and counts the amount of L2 MSHR targets (when there is at least one target) using the "ntargets" field of the MSHR class. Intuitively, this should correspond to the amount of simultaneously outstanding L2 misses and hence the level of MLP. However, I've noticed in the source code that there are actually two kinds of targets: regular targets and deferred targets.

From what I've gathered, when an MSHR request completes, the regular targets are satisfied and then swapped out with the deferred targets, so that now the MSHR will issue another request for the now-regular targets. This implies that including the deferred targets in the target count for measuring MLP is wrong, because they are processed serially after the regular ones, and not simultaneously.

The logic for dealing with deferred targets is quite complex and I haven't yet been able to grasp it fully, but I do understand that it is closely related to the handling of exclusive requests, so I think the bug I mentioned earlier might be involved here. Can anyone advise whether excluding or including deferred targets is the correct way of measuring MLP?

2) To measure the latencies of an L2 miss, I figured I could just use the system.l2.demand_avg_miss_latency stat, but I wanted to verify what exactly is being measured here and whether or not the degree of MLP is already included. From what I can see from the source code, this stat records the latency between the moment when the request was entered into its MSHR target and the time when the data becomes available from the bus (either the first word or the entire packet data), and this for each request entirely separately. In other words, this measures the latency for an individual request to individually miss in L2, go to memory and come back to L2, and hence does not account for MLP right?

So then would it be correct to say that the L1 demand_avg_miss_latency stat includes the average latencies of requests that go on to further miss in L2 as well as those that go on to hit in L2? I ask because I noticed the L1 demand_avg_miss_latency to sometimes be lower than L2's, and just wanted to verify that this is indeed caused by L1 miss/L2 hit requests. How would I then go about measuring the latencies from L1 and beyond but only for those requests that go on to miss in L2? I can add twice the L1 hit latency to L2's demand_avg_miss_latency, but this ignores the time spent on the L1/L2 bus.

3) Finally, I am somewhat concerned whether I am correctly measuring the amount of long-latency loads (i.e. those that will stall the CPU if they take too long). Right now I am counting only ReadExReq, ReadReq and WriteReq L2 misses as long-latency loads; exactly the amount measured by system.l2.demand_misses (and whose average latency is hence given by demand_avg_miss_latency). I'm only interested in requests that will cause the CPU's instruction window to block while it awaits the response, and I think that should about cover it, but I'm not sure. Can anyone confirm whether these are indeed all the memory packet types that will cause a CPU penalty upon missing in L2?

Many thanks in advance!

Best regards,
-- Jeroen DR
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Reply via email to