Re: [gem5-dev] Issue with O3 and interrupts

Gabe Black via gem5-dev Fri, 28 Nov 2014 18:32:28 -0800

Ok, I think I understand the problem now. I'll think about it a bit and see
if I can come up with good way to fix it.


Gabe

On Fri, Nov 28, 2014 at 3:15 PM, Castillo Villar, Emilio via gem5-dev <
gem5-dev@gem5.org> wrote:

> Dear all,
>
> Sorry for my poor explanations, I have tried to give as many details as
> possible but reading my previous emails they are somehow confusing, and my
> english skills are not yet very good.
> I will try to expose the problem in a few lines with references to the
> actual code. This might be better for the sake of understanding.
>
> When an interruption arrives, The commit stage stops the fetch of new
> instructions and waits till the rob is empty.
> Once the rob is empty the commit stage can process the interruption
> calling the interrupt->invoke method.
> through the cpu->processInterrupts(cpu->getInterrupts()); call at
> cpu/o3/commit_impl.hh
>
> The invoke call will end up executing the code at
> X86FaultBase::invoke(ThreadContext * tc, StaticInstPtr inst)
> (arch/x86/faults.cc)
>
> This call does 2 things, it stores the interrupt vector and the actual pc
> in two registers:
>
> tc->setIntReg(INTREG_MICRO(1), vector);
> tc->setIntReg(INTREG_MICRO(7), pc);
>
> And calls the code at  arch/x86/isa/insts/romutil.py by setting the uPC:
>
> entry = extern_label_longModeInterrupt;
> ...
> pcState.upc(romMicroPC(entry));
> pcState.nupc(romMicroPC(entry) + 1);
> tc->pcState(pcState);
>
> The code at that file just stores some of the processor status in the
> stack and jumps to the OS interrupt handler using the interrupt vector
> stored at INTREG_MICRO(1) register.
>
> Now the simulator is done with the interrupt and it gets discarded. Being
> its vector only accesible at INTREG_MICRO(1).
>
> Now there is a very narrow time frame where we have set the register there
> to hold the vector as a parameter to the longModeInterrupt routine, but we
> hadn't complete a fetch & decode of the first instruction, so the ROB
> remains empty.
>
> If another interruption arrives at that moment, the commit stage will
> detect an empty rob and will also do the invoke of the interrupt.
> Overwriting that registers and losing the first interrupt vector forever.
> Then the first interruption of that routine gets executed (Both interrupts
> jump to the same code, so the uPC remains unchanged) and finds the vector
> of the new interruption instead of the value stored when it was called.
>
> What I did to solve the issue was to inhibit all the interruptions during
> that time frame. It was the fastest thing that I could come up with ...
>
> However, this is a extremely weird race that is extremely unlikely to
> happen. I just had bad luck :).
>
> Hope this info is useful and now is better explained than before.
>
> Thank you all for your hard work on this wonderful tool!!
>
> Best regards,
>
> Emilio
>
> ________________________________________
> De: gem5-dev [gem5-dev-boun...@gem5.org] en nombre de Gabe Black via
> gem5-dev [gem5-dev@gem5.org]
> Enviado: viernes, 28 de noviembre de 2014 21:33
> Para: gem5 Developer List
> Asunto: Re: [gem5-dev] Issue with O3 and interrupts
>
> I feel like there should be a simple solution to this problem, but I
> haven't had the time to really walk through your explanation and understand
> it yet.
>
> Gabe
>
> On Fri, Nov 28, 2014 at 9:30 AM, Castillo Villar, Emilio via gem5-dev <
> gem5-dev@gem5.org> wrote:
>
> > Hello,
> >
> > The problem is that the first interrupt calls the invoke function of the
> > X86 Fault implementation arch/x86/fault.cc.
> > This function saves the PC and the interrupt vector in the micro arch
> > registers and calls the code at the Microcode rom (isa/insts/romutil.py)
> > "longModeInterrupt" by setting the uPC. This routine saves the pc in the
> > stack, and calculates the address to the OS interrupt service routine
> using
> > the interrupt vector. Then the simulator will forget about this fault.
> >
> > If we have a second interrupt when the first interrupt sets this
> registers
> > but hadn't completed the fetch of the first instruction of the
> > longModeInterrupt routine, the O3 CPU will detect an empty rob and will
> > allow this interrupt to proceed. It will overwrite the registers holding
> > the pc and the interrupt vector, that had not the chance of being saved.
> > Therefore the first interruption data will be lost, and when the
> > longModeInterrupt code first instruction arrives, it sees the status
> > (vector) of the second interruption.
> >
> > I did a hack to fix this situation where I completely disable the
> > interrupts during the time window between the set of this registers and
> the
> > Microcode rom execution.
> >
> > I added a new register that when set to 0x1, every single interruption is
> > ignored at the x86/interrupt.cc checkInterrupts function. This had to be
> > done because setting the IF at the flags registers do not disable all the
> > interruptions. Then I added a new microop at the x86 arch. that sets this
> > register to 0. I modified the routine that does all the above to call
> this
> > new instruction at the end. This way I made it work, its a bit hacky
> > solution so there might be some other elegant ways to solve this issue.
> >
> > Hope this can be helpful.
> >
> > Best regards.
> > ________________________________________
> > De: gem5-dev [gem5-dev-boun...@gem5.org] en nombre de Nilay Vaish via
> > gem5-dev [gem5-dev@gem5.org]
> > Enviado: viernes, 28 de noviembre de 2014 16:03
> > Para: Castillo Villar, Emilio via gem5-dev
> > Asunto: Re: [gem5-dev] Issue with O3 and interrupts
> >
> > Ok, I have not seen this problem, but I got the description below.  So
> > what's your suggestion on fixing the problem?  Should we add a stack of
> > pending interrupts instead of maintaining one single variable?
> >
> > --
> > Nilay
> >
> > On Wed, 26 Nov 2014, Castillo Villar, Emilio via gem5-dev wrote:
> >
> > > Good evening,
> > >
> > > I am experiencing a weird issue with the O3CPU, X86 and the interrupt
> > > handling. I am running in FS mode and one simulation just experienced a
> > > weird hang. The simulated machine is doing an spinlock over a value
> that
> > > an interrupt handler writes.
> > >
> > > After some debug I found that when the APIC sends two interruptions to
> > > the cpu in a very short time window, the first interruption is
> > > completely ignored. It can not even complete a commit of the first
> > > instruction in the service routine before all its values get replaced
> by
> > > the next interrupt. After this interrupt completes, the execution goes
> > > back to the application code and do not execute the code for the first
> > > interrupt.
> > >
> > > The problem is that the Lapic has the vector of the first interruption
> > in the ISR register as it gets restored after the second interruption
> > completes.  Therefore, it thinks that the cpu is currently processing
> that
> > interruption, though the cpu went back to execute application code and
> will
> > never clear this ISR register.
> > >
> > > The Lapic uses this ISR value to filter incoming interruptions and in
> > several cases, it does not forward those to the cpu, leading to
> unattended
> > interruptions and hangs.
> > >
> > > I have seen this behavior in the kernels' native_flush_tlb_others
> > function when a page fault happens. The core in charge of executing it,
> > sends an interrupt to all the other cores in the system and it does a
> loop
> > checking that every cores receive the interruption. When each core
> receives
> > the interruption, they just execute the associated handler  and perform a
> > write to a variable, notifying the sender that the interruption was
> > processed and the tlb was flushed.
> > >
> > > The problem is that one of the cores is ignoring this interruption,
> > which has a vector value of 0xf0. I found that this core lapic has a
> value
> > of 0xf1 in the ISR, filtering every lower vector.
> > > s
> > > This 0xf1 vector value was set by an interruption that never got to
> > execute because of the problem explained before, hence the interruption
> > carrying the 0xf0 vector value will never be executed and the
> > native_flush_tlb_others function will not complete.
> > >
> > > I just took a trace of the moment when the 0xf1 interruption gets
> > dropped. The flags used where Exec, Commit, Faults:
> > >
> > > system.cpu00.interrupts: Interrupt 0xf1 sent to core.
> > > 7175754213000: External Interrupt: RIP 0xffffffff8027a0a0: vector 0xf1:
> > #INTR
> > > 7175754213000: system.cpu00.interrupts: NEW IRR 0 NEW ISR f1.
> > >
> > > Now the interrupt 0xf3 gets to execute and drops all the first
> > interruption.
> > >
> > > 7175754217000: system.cpu00.interrupts: Got Trigger Interrupt message
> > with vector 0xf3.
> > > 7175754217000: system.cpu00.interrupts: Interrupt is an Fixed.
> > > 7175754220500: system.cpu00.commit: Interrupt detected.
> > > 7175754220500: system.cpu00.interrupts: Interrupt 0xf3 sent to core.
> > > 7175754220500: External Interrupt: RIP 0xffffffff8027a0a0: vector 0xf3:
> > #INTR
> > > 7175754220500: system.cpu00.interrupts: NEW IRR 0 NEW ISR f3.
> > >
> > > It can be seen how for both interrupts the RIP is the same.
> > >
> > > The first committed instruction after all this sequence of events is
> > >
> > > 7175754228500: system.cpu00 T0 : @handle_mm_fault+992.32768 :
> >  Microcode_ROM : slli   t4, t1, 0x4 : IntAlu :  D=0x0000000000000f30
> > >
> > > which indeed is from the 0xf3 interrupt.
> > > The cpu executes all the handler and then writes to the APIC EOI
> register
> > >
> > > 7175754418500: system.cpu00.interrupts: Writing Local APIC register 5
> at
> > offset 0xb0 as 0.
> > > 7175754418500: system.cpu00.interrupts: WRITING TO EOI NEW ISRV IS 0xf1
> > >
> > > Here the APIC believes it is servicing the 0xf1 interrupt. However the
> > cpu goes back to the code it was executing right before the 0xf1
> interrupt,
> > and never services it.
> > >
> > > I was wondering if someone has found this issue before.
> > >
> > > Thanks a lot for your time.
> > >
> > > Best regards,
> > >
> > > ---------------------------------------
> > >
> > > Emilio Castillo
> > > _______________________________________________
> > > gem5-dev mailing list
> > > gem5-dev@gem5.org
> > > http://m5sim.org/mailman/listinfo/gem5-dev
> > >
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-dev@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-dev@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Issue with O3 and interrupts

Reply via email to