Re: [gem5-dev] squashing bug in O3

Steve Reinhardt Sun, 13 Nov 2011 21:14:20 -0800

Thanks for the more detailed explanation... that helped a lot.  Sounds to
me like you're on the right track.


Steve

On Sun, Nov 13, 2011 at 8:20 PM, Gabe Black <[email protected]> wrote:

> No, we're not trying to undo anything. An example might help. Lets look
> at a dramatically simplified version of iret, the instruction that
> returns from an interrupt handler. The microops might do the following.
>
> 1. Restore prior privilege level.
> 2. If we were in kernel level, skip to 4.
> 3. Restore user level stack.
> 4. End.
>
> O3 fetches the bytes that go with iret, decodes that to a macroop, and
> starts picking microops out of it. Microop 1 is executed and drops to
> user level. Now microop 2 is executed, and O3 misspeculates that the
> branch is taken (for example). The mispredict is detected, and later
> microops in flight are squashed. O3 then attempts to restart where it
> should have gone, microop 3.
>
> Now, O3 looks at the PC involved and starts fetching the bytes which
> become the macroop which the microops are pulled from. Because microop 1
> successfully completed, the CPU is now at user level, but because the
> iret is on a kernel page, it can't be accessed. The kernel gets a page
> fault.
>
> As I mentioned before, my partially implemented fix is to not only pass
> back the PC, but to also pass back the macroop fetch should use instead
> of making it refetch memory. The problem is that it's partially
> implemented, and the way squashes work in O3 make it really tricky to
> implement it properly, or to tell whether or not it's implemented properly.
>
> Gabe
>
>
> On 11/13/11 19:21, Steve Reinhardt wrote:
> > I'd like to understand the issue a little better before commenting on a
> > solution.
> >
> > Gabe, when you say "instruction" in your original description, do you
> mean
> > micro-op?
> >
> > It seems to me that the fundamental problem is that we're trying to undo
> > the effects of a non-speculative micro-op, correct?  So the solution
> you're
> > pursuing is that branch mispredictions only roll back to the offending
> > micro-op, and don't force the entire macro-op containing that micro-op to
> > re-execute?
> >
> > Is this predicted control flow entirely internal to the macro-op?  Or is
> > this an RFI where we are integrating the control transfer and the
> privilege
> > change?  If it is the latter, why does the RFI need to get squashed at
> all?
> >
> > Steve
> >
> > On Sun, Nov 13, 2011 at 4:34 PM, Gabe Black <[email protected]>
> wrote:
> >
> >> Yes, this is an existing bug and the branch predictor just pokes things
> >> in the right way to expose it. The macroop isn't passed back in this
> >> particular case, and with the code the way it is, it's difficult to even
> >> tell that that's the case, let alone how to fix it. Cleaning things up
> >> won't fix the problem itself, but it will make fixing the actual problem
> >> tractable.
> >>
> >> Gabe
> >>
> >> On 11/13/11 16:16, Ali Saidi wrote:
> >>> I think this bug is just latently in the code right now and the branch
> >> predictor change runs into it (this patch causes that branch to be
> >> mispredicted). In any case I think the issue exists today and it's just
> >> luck that it works currently.
> >>> Looking at your list I imagine you should be able to recover most
> things
> >> from the dyninst, however I don't know if that is actually the case.
> >> Excepted that the squashing mechanisms should be cleaned up, I'm not
> sure
> >> how that is actually going to solve the problem. Don't we currently send
> >> back the instruction? With the current instructions can't you figure out
> >> the macro-op it belongs to?
> >>> Ali
> >>>
> >>>
> >>>
> >>> On Nov 13, 2011, at 5:40 PM, Gabe Black wrote:
> >>>
> >>>> Hey folks. Ali has had a change out for a while ("Fix several Branch
> >>>> Predictor issues") which improves branch predictor performance
> >>>> substantially but breaks X86_FS on O3. It turns out the problem is
> that
> >>>> an instruction is started which returns from kernel to user level and
> is
> >>>> microcoded. The instruction is fetched from the kernel's address space
> >>>> successfully and starts to execute, along the way dropping down to
> user
> >>>> mode. Some microops later, there's some microop control flow which O3
> >>>> mispredicts. When it squashes the mispredict and tries to restart, it
> >>>> first tries to refetch the instruction involved. Since it's now at
> user
> >>>> level and the instruction is on a kernel level only page, there's a
> page
> >>>> fault and things go downhill from there.
> >>>>
> >>>> I partially implemented a solution to this before where O3 reinstates
> >>>> the macroop it had been using when it restarts fetch. The problem here
> >>>> is that the path this kind of squash takes doesn't pass back the right
> >>>> information, and my attempts to fix that have been unsuccessful. The
> >>>> code that handles squashing in O3 is too complex, there's too much
> going
> >>>> in all directions, it's not always very clear what affect a change
> will
> >>>> have in unrelated situations, or which callsites are involved in a
> >>>> particular type of fault.
> >>>>
> >>>> To me, it seems like the first step in fixing this problem is to clean
> >>>> up how squashes are handled in O3 so that they can be made to
> >>>> consistently handle squashes in non-restartable macroops.
> >>>>
> >>>> Without having really dug into the specifics, I think we only need two
> >>>> pieces of information when squashing, a pointer to the guilty
> >>>> instruction and whether execution should start at or after it. It
> would
> >>>> start at it if the instruction needed to be reexecuted due to a memory
> >>>> dependence violation, for instance, and would start after it for
> faults,
> >>>> interrupts, or branch mispredicts. Any other information that's needed
> >>>> like sequence numbers or actual control flow targets can be retrieved
> >>>> from the instructions where needed without having to split everything
> >>>> out and pass them around individually.
> >>>>
> >>>> Is there any obvious problem with doing things this way? I don't think
> >>>> I'll personally have a lot of time to dedicate to this at the very
> least
> >>>> in the short term, but I wanted to get the conversation going so we
> know
> >>>> what to do when somebody has a chance to do it.
> >>>>
> >>>> Gabe
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> [email protected]
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> [email protected]
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >> _______________________________________________
> >> gem5-dev mailing list
> >> [email protected]
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> > _______________________________________________
> > gem5-dev mailing list
> > [email protected]
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] squashing bug in O3

Reply via email to