Thanks for the opinions. I'll look into fixing the O3 model. After sifting through the code it appears this problem is mainly left over assumptions due to Alpha ISA support only in the past.
Geoff On Tue, Apr 26, 2011 at 4:01 PM, Gabriel Michael Black < gbl...@eecs.umich.edu> wrote: > I'm not all that familiar with the ins and outs of O3 despite having to > dive into it a many occasions. My gut reaction is to fix O3. Making or > working around/accommodating arbitrary restrictions in the CPU models makes > things harder for everybody in the long run since things just get that much > more complicated. It might make this case easier, but Joe grad student > trying to implement a TBH for some other reason won't know the trick to get > around the bug, and we'll likely all forget. > > Speaking from a position of ignorance (which is very liberating, you should > try it :-) it sounds like the wrong PC is being used. If the TBH needs to be > squashed, then fetch should redirect to the PC of TBH. When TBH runs through > the system again, it will predict a PC again, and when it gets to the > appropriate place the mispredict will be discovered. > > You can think of there being three PCs involved with any given instruction, > the before, the after, and the predicted. Each of these are > multidimensional. ARM tracks a lot of extra state in the PC, so I'll use > Alpha as an example: > > Before: PC, NPC > After: PC, NPC > Predicted: PC, NPC > > When an instruction starts, Before is populated with the PC of that > instruction and the "fall through" next pc. "fall through" is an ambiguous > term, but it essentially means the PC that would be moved to if the > instruction was a nop. After is initially Before walked forward, so After.PC > = Before.NPC and After.NPC = fall through Before.NPC. When the instruction > makes modifications to the PC, it makes modifications to After. Predicted is > what the O3 thinks After will look like once the instruction runs. If > Predicted and After are different, After is used and Predicted is thrown > away, blah blah, branch prediction 101. > > If TBH is squashed due to a memory ordering problem, fetch should be > redirected to Before. Then execution will pick up TBH like it was new and go > through all the normal processing. Before should not reflect the execution > of TBH. I think After or Predicted are getting mixed in there someplace > which is why things are going wrong. > > The idea of having three PCs is a conceptual level thing and may not > reflect what is actually stored in the dyn_inst. Also, the names I've used > are entirely made up. > > O3 can be tricky to modify, but I think fixing the problem is a better > approach than working around it since that would really just delay the > inevitable. Maybe not -your- inevitable, but somebody's. > > Gabe > > > Quoting Korey Sewell <ksew...@umich.edu>: > > on #5, is it the case that the branch is always mispredicted on a >> squashDueToMemOrder() ??? I would think so because you havent got the >> branch >> back right? >> >> Sounds like there is a couple issues to tackle: >> 1) Where to start fetching while you wait for resolution? >> ---> a: look in the BTB for a predicted address and if so use that PC to >> fetch from. >> --------->If it's not in the BTB, you can start fetching down the >> not-taken >> path there. >> ---> b: Stall fetch at that point until the branch resolves. This would be >> similar to the "trapPending" flags that is used to keep fetch from going >> down a known wrong path I think. >> >> 2) Micro-op or fix up the branch? >> - I would say change the branch at that point there to always mispredict >> (maybe set the predictedPC to 0) to prevent using a dated prediction. >> - If you do the BTB think I suggested above then use that as the >> predictedPC >> - Then, once the branch resolves, let the normal mechanisms check against >> the predictedPC and it should squash always (if you reset the predPC to 0) >> or squash conditionally if you use the BTB to updated your predictedPC. >> - With regards to already outstanding squash, as long as that outstanding >> squash is not the oldest squash then your most recent squash will be used. >> This may provide a quirky problem with the microops depending on how those >> get their sequence numbers. However, I'll leave further elaboration on the >> micro-op option to Gabe/Ali since they are more in tune with any gotchas >> on >> that end, but I think you could do this without a microop a little >> cleaner. >> >> On Tue, Apr 26, 2011 at 3:12 PM, Geoffrey Blake <bla...@umich.edu> wrote: >> >> I've run into a buggy interaction for the ARM ISA between a TBH (or TBB) >>> instruction and a dependent memory operation (that gets squashed) in the >>> O3 >>> model leading to erroneous behavior when diffed against the Atomic model. >>> The TBH instruction is a table-based branch that has to index into memory >>> to >>> calculate its branch destination, so it is both a branch and a memory op. >>> The buggy behavior is as follows: >>> >>> 1) Fetch a TBH, predict branch destination >>> 2) Begin fetching from predicted PC (which happens to be correct in my >>> buggy >>> run) >>> 3) Issue younger dependent memory op to LSQ and send request to cache >>> ahead >>> of TBH which is waiting on register operands >>> 4) Issue TBH to LSQ to read memory for branch destination >>> 5) Memory violation detection with younger instruction and squash for >>> memory >>> ordering >>> --- This squash then calls squashDueToMemOrder(...), which redirects >>> the >>> PC of Fetch to a stale PC value stored in the TBH dyn-inst object as it >>> hasn't yet calculated its true PC >>> 6) Start fetching down wrong path >>> 7) TBH completes, but since the branch part was predicted correctly, no >>> additional squash happens in checkMisprediction (which it may not even >>> check >>> due the already outstanding squash) >>> >>> I see two ways to fix this, either hack up the O3 model to handle this >>> case >>> of a fused memory-op and branch instruction (recheck to squash when the >>> TBH >>> finally resolves for the special case of squashing dependent memory ops >>> causing the fetch to screw up the branch), or split the instruction into >>> 2 >>> micro-ops (the load and then a dependent branch). Which one do people >>> think >>> would be the better option? I'm currently leaning toward micro-coding >>> the >>> instruction. >>> >>> Thanks, >>> Geoff Blake >>> _______________________________________________ >>> m5-dev mailing list >>> m5-dev@m5sim.org >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >>> >> >> >> -- >> - Korey >> _______________________________________________ >> m5-dev mailing list >> m5-dev@m5sim.org >> http://m5sim.org/mailman/listinfo/m5-dev >> >> > > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org > http://m5sim.org/mailman/listinfo/m5-dev > _______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev