Korey,
Thank you for answering to my e-mail.
Today I have been looking a little further to solve my problem by
generating a trace for the "hello world" benchmark that's included in M5.
First I'll try to explain what my problem is.
After modifying the code to implement more stages, I noticed that some
benchmarks needed 20x more host time. My computer took 20x more time to
execute the simulation of a benchmark with 8 stages, compared to 5 stages.
I understand that the CPI will rise when more stages are added, this due
to the increased miss penalty. However, as you have confirmed, it is
impossible that the CPI would be 20x as high.
I wanted to know where the time was spent, so I compiled m5.prof, to get
the gprof statistics. After running this I saw that the most
time-consuming method was CacheUnit::removeAddrDependency: for a binary
that took 236 seconds, removeAddrDependency spent 205 seconds of it!
Using some c++ timers I came to the conclusion that the find-function
call took almost all of removeAddrDependency's time. I suspected that
this would be due to a big size of the addrList. After profiling the
code myself, this was confirmed: addrList contains only 1 instruction at
a time when only 5 stages are considered, but contains lots more
instructions in the case of 8 stages.
When simulating the hello benchmark we don't see that much of a time
difference, because of its small size. However we can see that addrList
contains more instructions when 8 stages are considered (it reaches a
maximum of 580 instructions!).
If you'lld generate a trace with --trace-flags="InOrderCPUAll,AddrDep"
you will see that CacheUnit::setAddrDependency is called more than
CacheUnit::removeAddrDependency.
Take for example instruction 0x12000067c. This one is added to addrList,
a little bit later a squash occurs and this instruction never gets
removed! This never happens in the case of 5 pipeline stages.
If you want I can give you all instructions that aren't removed..
Max
ps:
Thanks for mentioning that I should modify BackEndStartStage.
I also tried to insert more stages in the way you guys did this in
9-stage model, it's a lot easier and cleaner. In fact you only need to
change
InstStage *X = inst->addStage(); to
InstStage *X = inst->addStage(5);
in pipeline_traits.cc to have 8 pipeline stages =)
(and the proper adjustments in the pipeline_traits.hh)
On 03/11/2010 08:58 PM, Korey Sewell wrote:
Hi Max,
I'd be happy to help you look into things, but some more specifics and
perhaps a faulty instruction trace would help expedite the process.
(Further comments below)
The extra stages that I want to add shouldn't do anything since
I'm only
interested in the change in CPI they cause.
Did anybody do this before?
There is an example of a 9-stage pipeline model in the inorder tree.
It's a bit outdated as some of the scheduling structures underneath
changed, but the basic premise for adding more pipeline stages is
within the model's capability.
I tried it by modifying cpu/inorder/pipeline_traits.cc and
cpu/inorder/pipeline_traits.hh (see below).
The below looks like it should work save for also updating the
"BackEndStartStage" variable appropriately. (Also, we hope to get all
the variables in the pipeline_traits file into a parameterizable
format so users can do this off the command line .)
The result of my execution is correct but however it takes 20 times as
much time!!!
20x in real time or 20x in simulated ticks?
How much degradation do you expect ? Consider that you've added 3
extra stages before the branch is resolved in the execution unit, so
now you're mispredict penalty is lengthened in the pipeline. Depending
on your benchmark, this could make a difference but I agree a 20x
degradation seems extreme.
I also know there is a code optimization yet to made that should sleep
the CPU earleir if it's waiting for a long-event like a cache miss or
a FP opp. But that wouldn't effect your simulated ticks but your real
time waiting for the simulation to finish.
I debugged the code a little bit and found that there is some
problem in
the instruction fetch:
In cpu/inorder/resources/cache_unit.cc addrList accumulates a lot of
instructions because CacheUnit::setAddrDependency is called more than
CacheUnit::removeAddrDependency.
This happens only for the calls to the instruction cache (and thus
originating from the 0th stage (instruction fetch))!
Is this a direct source of pipeline stalling? I would imagine it
shouldnt be since the instruction cache is only performing reads, so
there is no need to block even for the same addresses.
Note in the cache_unit.cc::getSlot() code used to allocate an
instruction access to a resource:
" if (resName == "icache_port" ||
find(addrList[tid].begin(), addrList[tid].end(), req_addr) ==
addrList[tid].end()) {
...
"
The code should be saying that for the icache_port, we dont care about
address dependency but for anything else we do.
I'm not sure what the fix would be up front, since I'm also not sure
the problem is completely diagnosed. Can you point to a particular
trace of code that is executing poorly and causing unnecessary stalls?
Maybe you can try this on a simple benchmark to identify problem spots
easier?
--
- Korey
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users