Hello,
I tried to run a very simple program in the InOrder environment.
I extended the pipe to 12 stages, and also moved the prediction and fetch
update to the first stage.
j1: subl $2, 1, $2;
addl $1, 4, $1;
bgt $2, j1;
This loop runs for a million iterations.
The cpi for this part is expected to be 1 or very close to, but it is 4.
Looking in the debug flags I see the following:
Cycle n:
9522195: system.cpu.branch_predictor: [tid:0]: [sn:1002] requesting this
resource.
9522195: system.cpu.branch_predictor: [tid:0] [sn:1002] bge
r18,0x120008220 ... PC (0x120008238=>0x12000823c) doing branch prediction
9522195: system.cpu.branch_predictor: [tid:0]: [asid:0] Instruction
(0x120008238=>0x12000823c) predicted target is (0x120008238=>0x120008220).
9522195: system.cpu.branch_predictor: [tid:0]: [sn:1002]: Setting Predicted
PC to (0x120008238=>0x120008220).
9522195: system.cpu.branch_predictor: [tid:0] [sn:1002] pushed onto front of
predHist ...predHist.size(): 1
9522195: system.cpu.branch_predictor: [tid:0]: [sn:1002]: Branch predicted
true.
9522195: system.cpu.branch_predictor: [tid:0]: [sn:1002]: bge Predicted PC
is (0x120008238=>0x120008220).
9522195: system.cpu.branch_predictor[slot:0]:: done with request from
[sn:1002] [tid:0].
9522195: system.cpu.stage0: [tid:0]: [sn:1002] request to
system.cpu.branch_predictor completed.
9522195: system.cpu.branch_predictor: Deallocating [slot:0].
9522195: system.cpu.stage0: [tid:0]: [sn:1002]: sending request to
system.cpu.fetch_seq_unit.
9522195: system.cpu.fetch_seq_unit: Allocating [slot:1] for [tid:0]:
[sn:1002]
9522195: system.cpu.fetch_seq_unit: [tid:0]: [sn:1002] requesting this
resource.
9522195: system.cpu.fetch_seq_unit: [tid:0] Setting up squash to start from
stage 0, after [sn:0].
9522195: system.cpu.fetch_seq_unit[slot:1]:: done with request from
[sn:1002] [tid:0].
9522195: system.cpu.stage0: [tid:0]: [sn:1002] request to
system.cpu.fetch_seq_unit completed.
9522195: system.cpu.fetch_seq_unit: Deallocating [slot:1].
9522195: system.cpu.stage0: [tid:0]: Attempting to send instructions to
stage 1.
9522195: system.cpu.stage0: [tid:0] 1 slots available in next stage buffer.
9522195: system.cpu.stage0: [tid:0]: [sn:1002]: being placed into index 0
of stage buffer 0 queue.
9522195: system.cpu.stage0: Done Processing [tid:0]
9522195: system.cpu.stage0:
Cycle n+1:
9522612: system.cpu.fetch_seq_unit: Allocating [slot:0] for [tid:0]:
[sn:1003]
9522612: system.cpu.fetch_seq_unit: [tid:0]: instruction requesting this
resource.
9522612: system.cpu.fetch_seq_unit: [tid:0]: Current PC is
(0x120008238=>0x120008220)
9522612: system.cpu.fetch_seq_unit: [tid:0]: Assigning [sn:1003] to PC
(0x120008238=>0x120008220)
9522612: system.cpu.fetch_seq_unit[slot:0]:: done with request from
[sn:1003] [tid:0].
I cleaned a lot of unnecessary messeges for clarity. But the Idea is that
the branch command does not update the fetch correctly. The predictor has
problems predicting the second command claiming that the BTB missing an
entry for this command and so the branch is predicted as not taken
(reminder: this is the second consecutive appearance of this branch).
Later on stage 10 (out of 12 in my pipe) the second branch is being flushed
because of miss prediction. And it's happens all the time causing my CPI to
be 4 times it's calculated value.
I located the problem in the bpred_unit.cc file. And after I receive a
prediction from the predictor I advance it making the fetch unit not
fetching the second branch. Now all is well.
I attach my change (It's towards the end):
bool
BPredUnit::predict(DynInstPtr &inst, TheISA::PCState &predPC, ThreadID tid)
{
// See if branch predictor predicts taken.
// If so, get its target addr either from the BTB or the RAS.
// Save off record of branch stuff so the RAS can be fixed
// up once it's done.
using TheISA::MachInst;
int asid = inst->asid;
bool pred_taken = false;
TheISA::PCState target;
++lookups;
DPRINTF(InOrderBPred, "[tid:%i] [sn:%i] %s ... PC %s doing branch "
"prediction\n", tid, inst->seqNum,
inst->staticInst->disassemble(inst->instAddr()),
inst->pcState());
void *bp_history = NULL;
if (inst->isUncondCtrl()) {
DPRINTF(InOrderBPred, "[tid:%i] Unconditional control.\n",
tid);
pred_taken = true;
// Tell the BP there was an unconditional branch.
BPUncond(bp_history);
if (inst->isReturn() && RAS[tid].empty()) {
DPRINTF(InOrderBPred, "[tid:%i] RAS is empty, predicting "
"false.\n", tid);
pred_taken = false;
}
} else {
++condPredicted;
pred_taken = BPLookup(predPC.instAddr(), bp_history);
}
PredictorHistory predict_record(inst->seqNum, predPC, pred_taken,
bp_history, tid);
// Now lookup in the BTB or RAS.
if (pred_taken) {
if (inst->isReturn()) {
++usedRAS;
// If it's a function return call, then look up the address
// in the RAS.
TheISA::PCState rasTop = RAS[tid].top();
target = TheISA::buildRetPC(inst->pcState(), rasTop);
// Record the top entry of the RAS, and its index.
predict_record.usedRAS = true;
predict_record.RASIndex = RAS[tid].topIdx();
predict_record.rasTarget = rasTop;
assert(predict_record.RASIndex < 16);
RAS[tid].pop();
DPRINTF(InOrderBPred, "[tid:%i]: Instruction %s is a return, "
"RAS predicted target: %s, RAS index: %i.\n",
tid, inst->pcState(), target,
predict_record.RASIndex);
} else {
++BTBLookups;
if (inst->isCall()) {
RAS[tid].push(inst->pcState());
// Record that it was a call so that the top RAS entry can
// be popped off if the speculation is incorrect.
predict_record.wasCall = true;
DPRINTF(InOrderBPred, "[tid:%i]: Instruction %s was a call"
", adding %s to the RAS index: %i.\n",
tid, inst->pcState(), predPC,
RAS[tid].topIdx());
}
if (inst->isCall() &&
inst->isUncondCtrl() &&
inst->isDirectCtrl()) {
target = inst->branchTarget();
} else if (BTB.valid(predPC.instAddr(), asid)) {
++BTBHits;
// If it's not a return, use the BTB to get the target addr.
target = BTB.lookup(predPC.instAddr(), asid);
DPRINTF(InOrderBPred, "[tid:%i]: [asid:%i] Instruction %s "
"predicted target is %s.\n",
tid, asid, inst->pcState(), target);
} else {
DPRINTF(InOrderBPred, "[tid:%i]: BTB doesn't have a "
"valid entry, predicting false.\n",tid);
pred_taken = false;
}
}
}
if (pred_taken) {
// Set the PC and the instruction's predicted target.
predPC = target;
TheISA::advancePC(predPC, inst->staticInst); ß
}
DPRINTF(InOrderBPred, "[tid:%i]: [sn:%i]: Setting Predicted PC to
%s.\n",
tid, inst->seqNum, predPC);
predHist[tid].push_front(predict_record);
DPRINTF(InOrderBPred, "[tid:%i] [sn:%i] pushed onto front of predHist "
"...predHist.size(): %i\n",
tid, inst->seqNum, predHist[tid].size());
return pred_taken;
}
Does my story seems right and is my correction is in place?
Thanks,
Yuval.
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users