Hi,
I'm currently working on an analysis of L2 NUCA caches, and as part of
this I'm looking to produce a CPI stack detailing the relative
importance of each miss latency component at and beyond the L2 level. To
do this, I'm first looking to obtain the base CPI from L2's perspective,
i.e. the CPI obtained if all latencies at and beyond the L2 level were
to be zero. This will yield the amount of cycles per instruction the CPU
spends that are unrelated to L2+ activity, and hence, the amount of
cycles that will remain no matter what happens at L2 and lower.
I can measure this by setting all L2+ latencies to zero (or as close to
zero as possible); however, I would also like to be able to model this
base CPI so that it can be computed from an execution profile with the
L2+ latencies intact. To do this, I'm using the interval analysis model
as outlined by Eyerman, S.; Eeckhout, L.; Karkhanis T. and Smith, J.E.
in "A Top-Down Approach to Architecting CPI Component Performance
Counters". For non-overlapping long latency loads, this model predicts
that the CPU will experience a cycle penalty given the amount of time
between the moment where the ROB fills up and IPC starts to drop (due to
dependencies on the missed load) and the moment where the load returns.
This model also recognizes that multiple outstanding long-latency loads
may overlap, so that the cycle penalty to the CPU is reduced by
approximately the level of MLP achieved during the overlapping misses.
So to apply this model, I will need to measure the amount of MLP at the
L2 level, the amount of L2 misses (i.e. long-latency loads) and the
entire latency of an L2 miss (i.e. all the way from CPU to memory and
back). I have done this, but I have a couple of questions about whether
I'm measuring them correctly.
I should note that I'm using an older version of M5: revision 6819. I
realize that there is a newer version available and that it fixes many
bugs, but the effort of porting my modifications to the newer version is
significant and my time is limited. There is one bug (or at least, I
think it is a bug) that I believe may be relevant to this problem as it
apparently made all writebacks from L1 to L2 end up as shared in L2 even
if there's only one CPU. This caused many unnecessary misses in L2 for
requests that needed exclusive blocks as they would all fail the
exclusive permission check, but I believe it might also affect this next
measurement:
1) To measure MLP, I've created an event that executes every cycle and
counts the amount of L2 MSHR targets (when there is at least one target)
using the "ntargets" field of the MSHR class. Intuitively, this should
correspond to the amount of simultaneously outstanding L2 misses and
hence the level of MLP. However, I've noticed in the source code that
there are actually two kinds of targets: regular targets and deferred
targets.
From what I've gathered, when an MSHR request completes, the regular
targets are satisfied and then swapped out with the deferred targets, so
that now the MSHR will issue another request for the now-regular
targets. This implies that including the deferred targets in the target
count for measuring MLP is wrong, because they are processed serially
after the regular ones, and not simultaneously.
The logic for dealing with deferred targets is quite complex and I
haven't yet been able to grasp it fully, but I do understand that it is
closely related to the handling of exclusive requests, so I think the
bug I mentioned earlier might be involved here. Can anyone advise
whether excluding or including deferred targets is the correct way of
measuring MLP?
2) To measure the latencies of an L2 miss, I figured I could just use
the system.l2.demand_avg_miss_latency stat, but I wanted to verify what
exactly is being measured here and whether or not the degree of MLP is
already included. From what I can see from the source code, this stat
records the latency between the moment when the request was entered into
its MSHR target and the time when the data becomes available from the
bus (either the first word or the entire packet data), and this for each
request entirely separately. In other words, this measures the latency
for an individual request to individually miss in L2, go to memory and
come back to L2, and hence does not account for MLP right?
So then would it be correct to say that the L1 demand_avg_miss_latency
stat includes the average latencies of requests that go on to further
miss in L2 as well as those that go on to hit in L2? I ask because I
noticed the L1 demand_avg_miss_latency to sometimes be lower than L2's,
and just wanted to verify that this is indeed caused by L1 miss/L2 hit
requests. How would I then go about measuring the latencies from L1 and
beyond but only for those requests that go on to miss in L2? I can add
twice the L1 hit latency to L2's demand_avg_miss_latency, but this
ignores the time spent on the L1/L2 bus.
3) Finally, I am somewhat concerned whether I am correctly measuring the
amount of long-latency loads (i.e. those that will stall the CPU if they
take too long). Right now I am counting only ReadExReq, ReadReq and
WriteReq L2 misses as long-latency loads; exactly the amount measured by
system.l2.demand_misses (and whose average latency is hence given by
demand_avg_miss_latency). I'm only interested in requests that will
cause the CPU's instruction window to block while it awaits the
response, and I think that should about cover it, but I'm not sure. Can
anyone confirm whether these are indeed all the memory packet types that
will cause a CPU penalty upon missing in L2?
Many thanks in advance!
Best regards,
-- Jeroen DR
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users