Hi Max,
I'd be happy to help you look into things, but some more specifics and
perhaps a faulty instruction trace would help expedite the process.
(Further comments below)
> The extra stages that I want to add shouldn't do anything since I'm only
> interested in the change in CPI they cause.
> Did anybody do this before?
>
There is an example of a 9-stage pipeline model in the inorder tree. It's a
bit outdated as some of the scheduling structures underneath changed, but
the basic premise for adding more pipeline stages is within the model's
capability.
I tried it by modifying cpu/inorder/pipeline_traits.cc and
> cpu/inorder/pipeline_traits.hh (see below).
>
The below looks like it should work save for also updating the
"BackEndStartStage" variable appropriately. (Also, we hope to get all the
variables in the pipeline_traits file into a parameterizable format so users
can do this off the command line .)
> The result of my execution is correct but however it takes 20 times as
> much time!!!
>
20x in real time or 20x in simulated ticks?
How much degradation do you expect ? Consider that you've added 3 extra
stages before the branch is resolved in the execution unit, so now you're
mispredict penalty is lengthened in the pipeline. Depending on your
benchmark, this could make a difference but I agree a 20x degradation seems
extreme.
I also know there is a code optimization yet to made that should sleep the
CPU earleir if it's waiting for a long-event like a cache miss or a FP opp.
But that wouldn't effect your simulated ticks but your real time waiting for
the simulation to finish.
> I debugged the code a little bit and found that there is some problem in
> the instruction fetch:
> In cpu/inorder/resources/cache_unit.cc addrList accumulates a lot of
> instructions because CacheUnit::setAddrDependency is called more than
> CacheUnit::removeAddrDependency.
> This happens only for the calls to the instruction cache (and thus
> originating from the 0th stage (instruction fetch))!
>
Is this a direct source of pipeline stalling? I would imagine it shouldnt be
since the instruction cache is only performing reads, so there is no need to
block even for the same addresses.
Note in the cache_unit.cc::getSlot() code used to allocate an instruction
access to a resource:
" if (resName == "icache_port" ||
find(addrList[tid].begin(), addrList[tid].end(), req_addr) ==
addrList[tid].end()) {
...
"
The code should be saying that for the icache_port, we dont care about
address dependency but for anything else we do.
I'm not sure what the fix would be up front, since I'm also not sure the
problem is completely diagnosed. Can you point to a particular trace of code
that is executing poorly and causing unnecessary stalls? Maybe you can try
this on a simple benchmark to identify problem spots easier?
--
- Korey
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users