Hi Nilay, I think Gabe's point is that if you take the problem he is describing and replace "misprediction" with "interrupt" then what you're seeing is just a different manifestation of the same underlying problem. In a sense an interrupt is just a control-flow misprediction, only much rarer and mostly unavoidable... the pipeline predicts that control will always go to the next sequential instruction for non-control-flow-instructions (and to either the next instruction or the target for control-flow instructions) and an interrupt is a situation where that prediction is wrong.
One difference is that in the branch misprediction case, the pipeline immediately refetches along the correct path, while in the interrupt case, the interrupt handler gets executed first, and the problem doesn't show up until you return from the interrupt. I think Gabe is also saying that if we did a larger restructuring of how squashes are handled we could solve both problems at once (and perhaps other related cases we haven't encountered yet). I agree that that sounds good in theory, but in practice if there's a straightforward fix for the current issue, I'm not opposed to putting that in. Steve On Fri, Dec 30, 2011 at 6:54 AM, Nilay Vaish <[email protected]> wrote: > Gabe, from all the traces that I have been through, I never found your > explanation to be true. Here is an excerpt from your first email in that > thread -- > > ---- > It turns out the problem is that an instruction is started which returns > from kernel to user level and is microcoded. The instruction is fetched > from the kernel's address space successfully and starts to execute, along > the way dropping down to user mode. Some microops later, there's some > microop control flow which O3 mispredicts. When it squashes the mispredict > and tries to restart, it first tries to refetch the instruction involved. > Since it's now at user level and the instruction is on a kernel level only > page, there's a page fault and things go downhill from there. > ---- > > You claim that because of branch misprediction, the instruction needs to > be refetched. I never saw that happening. In all the cases your fix, in > which the fetch stage picks the current macroop from the just now squashed > instruction, worked as expected. > > What I saw was that an isSquashAfter microop squashes everything and this > makes the O3 CPU start on handling interrupt. Returning from the interrupt, > it faults since the CS register had been overwritten by the sysret > instruction, but not the instruction pointer. > > Again, I think we need to change the behavior of isSquashAfter. > > -- > Nilay > > > On Fri, 30 Dec 2011, Korey Sewell wrote: > > I can't vouch for reading all the emails but I have gone through this >> whole >> thread (which dates back to Nov. 29th). >> >> Also, I'm not all the way familiar with x86 so maybe this excludes me from >> understanding the problem at the detailed level, but I think I am starting >> to get a good grasp of the general squashing problem here (basically >> maintaining squash state through exception events). >> >> My concern is that if you don't literally "fix" the problem first, you can >> get caught up in the minutia of making this big grand sweeping change and >> then have no good way to say if "the fix" fixes anything in the first >> place. >> >> If Nilay or anyone could get something to the reviewboard that worked, >> hack >> or not, then that would be a good step toward making the "clean" change >> that I think you're referring to Gabe. We dont have to commit the code, >> but >> on a 1st pass working is better then "not working", right? :) >> >> (Gabe, I do understand it can be frustrating explaining the same things >> over/over again.) >> >> On Fri, Dec 30, 2011 at 3:48 AM, Gabe Black <[email protected]> >> wrote: >> >> If you read my emails the problem would already be identified and >>> understood, because I did that weeks or even months ago and explained it >>> multiple times. A hack fix is not ok. A hack fix is why this is still >>> broken in the first place. That's also something I explained in my >>> emails. >>> >>> Gabe >>> >>> On 12/30/11 02:50, Korey Sewell wrote: >>> >>>> I agree with you Gabe that the squashing mechanism could be cleaned up. >>>> >>>> But I'd also suggest that Nilay should understand/identify the problem >>>> first and then implement a first-pass fix without a big squashing revamp >>>> (if possible). >>>> >>>> That way, when we (nilay, you, me, whoever) gets to revamping the squash >>>> code, there is at least a set test case and trace we can use to debug >>>> >>> with.. >>> >>>> >>>> On Fri, Dec 30, 2011 at 2:30 AM, Gabe Black <[email protected]> >>>> >>> wrote: >>> >>>> >>>> What was unclear about this email and the ones before it? Did you not >>>>> believe me for some reason? You've spent about a month partially >>>>> rediscovering what I explained in them. I've already said how this >>>>> needs >>>>> to be fixed. >>>>> >>>>> Gabe >>>>> ______________________________**_________________ >>>>> gem5-dev mailing list >>>>> [email protected] >>>>> http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev> >>>>> >>>>> >>>> >>>> >>> ______________________________**_________________ >>> gem5-dev mailing list >>> [email protected] >>> http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev> >>> >>> >> >> >> -- >> - Korey >> ______________________________**_________________ >> gem5-dev mailing list >> [email protected] >> http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev> >> >> ______________________________**_________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev> > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
