I am sorry for that we have diverse opinions. I think firstly you should
refer to this paper to understand the simulation methodology I mentioned.
Zhan, D. Locality & Utility Co-optimization for Practical Capacity
Management of Shared Last Level Caches. ICS'12
For your convenience, I extract the simulation methodology here.
"In the experiments, all threads under a given workload are executed
starting from a checkpoint that has already had the first 10 billion
instructions bypassed. They are cache-warmed with 1 billion instructions
and then simulated in detail until all threads finish another 1 billion
instructions. Performance statistics are reported for a thread when it
reaches 1 billion instructions. If one thread completes the 1 billion
instructions before others, it continues to run so as to still compete
for the SLLC capacity, but its extra instructions are not taken into
account in the final performance report. This is in conformation with
the standard practice in CMP cache research"
For the 1st question, I do not insist on exactly N3 instructions at all.
Actually, it is not possible to count exact instruction numbers. But to
keep consistent with the above simulation methodology, I have to enforce
each core execute at least N3 instructions. I reviewed the currentl
implementation of the option '-I' in configs/common/Simulation.py and
src/cpu/base.cc. It just passed the '-I' value to
cpu[i].MAX_INSTS_ANY_THREAD. In this case it only guarantees that it
exits the simulation if one core commits N3 instructions no matter how
many instructions retired from the other cores.
For the 2nd question, considering that some programs may finish N3
instruction before the others, if we only run total N3 instruction for
whole programs and report stats after N3 instructions, I don't think the
stats can mirror the real impact of shared resource contention since
during some phases no contention exists at all. On the other hand, by
enforcing every program enforced to run N3 * 2 instructions, we have the
chance to report stats after the first N3 instructions thus the stats
can reflect the impact due to shared resource contention.
Of course, I think the above simulation methodology still has pitfalls.
For some program with short lifetime, even executed N3 * 2 instructions,
we still can not guarantee it will contend shared resource with other
programs.
I implement this methodology in gem5. For M multiprogrammed workload, it
dumps (M + 1) stats as expected. But now I do not obtain the dump order
to extract core information. E.g., if the dump order is c1->c2->c0->c3,
then we get stats related to c0 in the 1st dump and that to c2 in the
2nd dump and so on. Here we must note that the stats dump order mirrors
the order in which every program finishes N3 instructions.
Thanks,
Hanfeng
On 12/14/2012 06:52 PM, Nilay Vaish wrote:
On Fri, 14 Dec 2012, hanfeng QIN wrote:
I know the options '-F' and '-W'. Actualy, I use them together with
'-I' option to specify the detailed instruction numbers (as denoted
with N3 in my previous mail). It seems that the default
implementation in configs/common/Simulation.py will pass the N3 to
cpu[i].MAX_INSTS_ANY_THREAD. Thus, when any program finishes N3
instructions, the total simulation will exit. Obviously, in this case
I modify this default implementaion by passing N3 to
cpu[i].MAX_INSTS_ALL_THREADS, which will force each program to commit
at least N3 instructions. Then the final total instruction simulated
will be N3 * Nr_cores. But this approach has a pitfall compared with
the methodology I referred. For multi-programmed workload, once some
program finishes N3 instructions, the corresponding core will have no
task to schedule ( I assume the number of workload will be no more
than available cores simulated). Thus, it may be not reasonable to
evaluate its impact on shared resource contention according to final
statistics report.
I don't understand why you are so insistent that every core has to
execute exactly N3 number of instructions. A much more realistic
experiment would be one where each core has executed at least N3
instructions. If you understand how the option -I has been
implemented, it should be straight forward for you to modify gem5 and
dump stats when all cores have executed at least N3 instructions.
Based on this, I have an idea to report statistics more reasonable.
Can we carry out detailed simulated N3 * 2 instructions for each
program (thus total instruction simulated will be (N3 * 2) *
Nr_cores) but only dump the stats after the first N3 instructions?
But I am not clear on the stats dump internals.
I don't see why executing twice the number of instructions would make
any difference. Depending on the latencies in the system, the ratio of
IPC's of two cores can be very low/high.
I would rather suggest that you think more about the experiment you
are proposing. Why is it essential that each core has executed exactly
N3 instructions? Is this experiment realistic?
--
Nilay
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users