Re: [m5-users] InOrderCPU deeper pipeline

Maximilien Breughe Fri, 12 Mar 2010 07:45:58 -0800

Korey,

Thank you for answering to my e-mail.

Today I have been looking a little further to solve my problem bygenerating a trace for the "hello world" benchmark that's included in M5.

First I'll try to explain what my problem is.

After modifying the code to implement more stages, I noticed that somebenchmarks needed 20x more host time. My computer took 20x more time toexecute the simulation of a benchmark with 8 stages, compared to 5 stages.I understand that the CPI will rise when more stages are added, this dueto the increased miss penalty. However, as you have confirmed, it isimpossible that the CPI would be 20x as high.

I wanted to know where the time was spent, so I compiled m5.prof, to getthe gprof statistics. After running this I saw that the mosttime-consuming method was CacheUnit::removeAddrDependency: for a binarythat took 236 seconds, removeAddrDependency spent 205 seconds of it!Using some c++ timers I came to the conclusion that the find-functioncall took almost all of removeAddrDependency's time. I suspected thatthis would be due to a big size of the addrList. After profiling thecode myself, this was confirmed: addrList contains only 1 instruction ata time when only 5 stages are considered, but contains lots moreinstructions in the case of 8 stages.

When simulating the hello benchmark we don't see that much of a timedifference, because of its small size. However we can see that addrListcontains more instructions when 8 stages are considered (it reaches amaximum of 580 instructions!).If you'lld generate a trace with --trace-flags="InOrderCPUAll,AddrDep"you will see that CacheUnit::setAddrDependency is called more thanCacheUnit::removeAddrDependency.Take for example instruction 0x12000067c. This one is added to addrList,a little bit later a squash occurs and this instruction never getsremoved! This never happens in the case of 5 pipeline stages.

If you want I can give you all instructions that aren't removed..


Max


ps:
Thanks for mentioning that I should modify BackEndStartStage.

I also tried to insert more stages in the way you guys did this in9-stage model, it's a lot easier and cleaner. In fact you only need tochange

InstStage *X = inst->addStage(); to
InstStage *X = inst->addStage(5);
in pipeline_traits.cc to have 8 pipeline stages =)
(and the proper adjustments in the pipeline_traits.hh)

On 03/11/2010 08:58 PM, Korey Sewell wrote:

Hi Max,
I'd be happy to help you look into things, but some more specifics andperhaps a faulty instruction trace would help expedite the process.
(Further comments below)


    The extra stages that I want to add shouldn't do anything since
    I'm only
    interested in the change in CPI they cause.
    Did anybody do this before?
There is an example of a 9-stage pipeline model in the inorder tree.It's a bit outdated as some of the scheduling structures underneathchanged, but the basic premise for adding more pipeline stages iswithin the model's capability.
    I tried it by modifying cpu/inorder/pipeline_traits.cc and
    cpu/inorder/pipeline_traits.hh (see below).
The below looks like it should work save for also updating the"BackEndStartStage" variable appropriately. (Also, we hope to get allthe variables in the pipeline_traits file into a parameterizableformat so users can do this off the command line .)
    The result of my execution is correct but however it takes 20 times as
    much time!!!

20x in real time or 20x in simulated ticks?
How much degradation do you expect ? Consider that you've added 3extra stages before the branch is resolved in the execution unit, sonow you're mispredict penalty is lengthened in the pipeline. Dependingon your benchmark, this could make a difference but I agree a 20xdegradation seems extreme.
I also know there is a code optimization yet to made that should sleepthe CPU earleir if it's waiting for a long-event like a cache miss ora FP opp. But that wouldn't effect your simulated ticks but your realtime waiting for the simulation to finish.
    I debugged the code a little bit and found that there is some
    problem in
    the instruction fetch:
    In cpu/inorder/resources/cache_unit.cc addrList accumulates a lot of
    instructions because CacheUnit::setAddrDependency is called more than
    CacheUnit::removeAddrDependency.
    This happens only for the calls to the instruction cache (and thus
    originating from the 0th stage (instruction fetch))!
Is this a direct source of pipeline stalling? I would imagine itshouldnt be since the instruction cache is only performing reads, sothere is no need to block even for the same addresses.
Note in the cache_unit.cc::getSlot() code used to allocate aninstruction access to a resource:
"   if (resName == "icache_port" ||
        find(addrList[tid].begin(), addrList[tid].end(), req_addr) ==
        addrList[tid].end()) {
...
"
The code should be saying that for the icache_port, we dont care aboutaddress dependency but for anything else we do.
I'm not sure what the fix would be up front, since I'm also not surethe problem is completely diagnosed. Can you point to a particulartrace of code that is executing poorly and causing unnecessary stalls?Maybe you can try this on a simple benchmark to identify problem spotseasier?
--
- Korey


_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Re: [m5-users] InOrderCPU deeper pipeline

Reply via email to