I am sorry for that we have diverse opinions. I think firstly you should refer to this paper to understand the simulation methodology I mentioned.

Zhan, D. Locality & Utility Co-optimization for Practical Capacity Management of Shared Last Level Caches. ICS'12

For your convenience, I extract the simulation methodology here.

"In the experiments, all threads under a given workload are executed starting from a checkpoint that has already had the first 10 billion instructions bypassed. They are cache-warmed with 1 billion instructions and then simulated in detail until all threads finish another 1 billion instructions. Performance statistics are reported for a thread when it reaches 1 billion instructions. If one thread completes the 1 billion instructions before others, it continues to run so as to still compete for the SLLC capacity, but its extra instructions are not taken into account in the final performance report. This is in conformation with the standard practice in CMP cache research"

For the 1st question, I do not insist on exactly N3 instructions at all. Actually, it is not possible to count exact instruction numbers. But to keep consistent with the above simulation methodology, I have to enforce each core execute at least N3 instructions. I reviewed the currentl implementation of the option '-I' in configs/common/Simulation.py and src/cpu/base.cc. It just passed the '-I' value to cpu[i].MAX_INSTS_ANY_THREAD. In this case it only guarantees that it exits the simulation if one core commits N3 instructions no matter how many instructions retired from the other cores.

For the 2nd question, considering that some programs may finish N3 instruction before the others, if we only run total N3 instruction for whole programs and report stats after N3 instructions, I don't think the stats can mirror the real impact of shared resource contention since during some phases no contention exists at all. On the other hand, by enforcing every program enforced to run N3 * 2 instructions, we have the chance to report stats after the first N3 instructions thus the stats can reflect the impact due to shared resource contention.

Of course, I think the above simulation methodology still has pitfalls. For some program with short lifetime, even executed N3 * 2 instructions, we still can not guarantee it will contend shared resource with other programs.

I implement this methodology in gem5. For M multiprogrammed workload, it dumps (M + 1) stats as expected. But now I do not obtain the dump order to extract core information. E.g., if the dump order is c1->c2->c0->c3, then we get stats related to c0 in the 1st dump and that to c2 in the 2nd dump and so on. Here we must note that the stats dump order mirrors the order in which every program finishes N3 instructions.


Thanks,
Hanfeng


On 12/14/2012 06:52 PM, Nilay Vaish wrote:
On Fri, 14 Dec 2012, hanfeng QIN wrote:

I know the options '-F' and '-W'. Actualy, I use them together with '-I' option to specify the detailed instruction numbers (as denoted with N3 in my previous mail). It seems that the default implementation in configs/common/Simulation.py will pass the N3 to cpu[i].MAX_INSTS_ANY_THREAD. Thus, when any program finishes N3 instructions, the total simulation will exit. Obviously, in this case I modify this default implementaion by passing N3 to cpu[i].MAX_INSTS_ALL_THREADS, which will force each program to commit at least N3 instructions. Then the final total instruction simulated will be N3 * Nr_cores. But this approach has a pitfall compared with the methodology I referred. For multi-programmed workload, once some program finishes N3 instructions, the corresponding core will have no task to schedule ( I assume the number of workload will be no more than available cores simulated). Thus, it may be not reasonable to evaluate its impact on shared resource contention according to final statistics report.

I don't understand why you are so insistent that every core has to execute exactly N3 number of instructions. A much more realistic experiment would be one where each core has executed at least N3 instructions. If you understand how the option -I has been implemented, it should be straight forward for you to modify gem5 and dump stats when all cores have executed at least N3 instructions.


Based on this, I have an idea to report statistics more reasonable. Can we carry out detailed simulated N3 * 2 instructions for each program (thus total instruction simulated will be (N3 * 2) * Nr_cores) but only dump the stats after the first N3 instructions? But I am not clear on the stats dump internals.


I don't see why executing twice the number of instructions would make any difference. Depending on the latencies in the system, the ratio of IPC's of two cores can be very low/high.

I would rather suggest that you think more about the experiment you are proposing. Why is it essential that each core has executed exactly N3 instructions? Is this experiment realistic?

--
Nilay

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to