Re: [gem5-dev] Issue with O3 and interrupts

Gabe Black via gem5-dev Mon, 01 Dec 2014 02:45:57 -0800

Not necessarily. An ISA may say it's ok to recognize a higher priority
interrupt before actually executing any instructions of the lower priority
interrupt handler. The interrupt entering mechanism in x86 is very
complicated and can't realistically be handled instantly in the fault
object's invoke method, especially since it can't write to memory to set up
the exception stack. I think the other ISAs might be fine as is, although
I'm not confident enough to guarantee that's true.


Gabe

On Sun, Nov 30, 2014 at 10:38 AM, Nilay Vaish via gem5-dev <
gem5-dev@gem5.org> wrote:

> Should we not go for an architecture independent solution?  I would go for
> the commit stage marking itself as PREPARING_FOR_ISR.  New interrupts would
> not be accepted while in this new state.  Once instructions start getting
> committed, we would go back to some normal state.
>
> --
> Nilay
>
>
>
> On Fri, 28 Nov 2014, Gabe Black via gem5-dev wrote:
>
>  I'm thinking right now is that your new register idea isn't that far off.
>> Instead of that, I'd suggest we add a new flag in the flags register
>> somewhere x86 code can't see it. There are already a couple flags like
>> that
>> called ECF and EZF (emulation carry flag and emulation zero flag) which
>> are
>> used by microcode. The code that enters the interrupt handler can set that
>> flag, and then the code which exits the microcode ROM could clear it when
>> it executes wrflags right at the end. We can talk about that in more
>> detail
>> if necessary.
>>
>> One way or the other, you'd have to make sure that if you took an
>> exception
>> preparing for the interrupt in the microcode ROM that the new flag would
>> still be cleared somehow. I *think* all the various types of exceptions
>> end
>> up somewhere in the microcode ROM before returning control to x86 code, so
>> as long as all those clear the new flag on the way out it *should* be
>> enough. It's worth double checking that though.
>>
>> As an aside, I also notice that in X86Trap::invoke, it gets the current
>> PC,
>> adjusts it to make it go to the next x86 instruction, and then doesn't do
>> anything with it. I don't remember what I was trying to do there, but as
>> far as I can tell what it's doing now is sort of pointless...
>>
>> Gabe
>>
>> On Fri, Nov 28, 2014 at 6:32 PM, Gabe Black <gabebl...@google.com> wrote:
>>
>>  Ok, I think I understand the problem now. I'll think about it a bit and
>>> see if I can come up with good way to fix it.
>>>
>>> Gabe
>>>
>>> On Fri, Nov 28, 2014 at 3:15 PM, Castillo Villar, Emilio via gem5-dev <
>>> gem5-dev@gem5.org> wrote:
>>>
>>>  Dear all,
>>>>
>>>> Sorry for my poor explanations, I have tried to give as many details as
>>>> possible but reading my previous emails they are somehow confusing, and
>>>> my
>>>> english skills are not yet very good.
>>>> I will try to expose the problem in a few lines with references to the
>>>> actual code. This might be better for the sake of understanding.
>>>>
>>>> When an interruption arrives, The commit stage stops the fetch of new
>>>> instructions and waits till the rob is empty.
>>>> Once the rob is empty the commit stage can process the interruption
>>>> calling the interrupt->invoke method.
>>>> through the cpu->processInterrupts(cpu->getInterrupts()); call at
>>>> cpu/o3/commit_impl.hh
>>>>
>>>> The invoke call will end up executing the code at
>>>> X86FaultBase::invoke(ThreadContext * tc, StaticInstPtr inst)
>>>> (arch/x86/faults.cc)
>>>>
>>>> This call does 2 things, it stores the interrupt vector and the actual
>>>> pc
>>>> in two registers:
>>>>
>>>> tc->setIntReg(INTREG_MICRO(1), vector);
>>>> tc->setIntReg(INTREG_MICRO(7), pc);
>>>>
>>>> And calls the code at  arch/x86/isa/insts/romutil.py by setting the uPC:
>>>>
>>>> entry = extern_label_longModeInterrupt;
>>>> ...
>>>> pcState.upc(romMicroPC(entry));
>>>> pcState.nupc(romMicroPC(entry) + 1);
>>>> tc->pcState(pcState);
>>>>
>>>> The code at that file just stores some of the processor status in the
>>>> stack and jumps to the OS interrupt handler using the interrupt vector
>>>> stored at INTREG_MICRO(1) register.
>>>>
>>>> Now the simulator is done with the interrupt and it gets discarded.
>>>> Being
>>>> its vector only accesible at INTREG_MICRO(1).
>>>>
>>>> Now there is a very narrow time frame where we have set the register
>>>> there to hold the vector as a parameter to the longModeInterrupt
>>>> routine,
>>>> but we hadn't complete a fetch & decode of the first instruction, so the
>>>> ROB remains empty.
>>>>
>>>> If another interruption arrives at that moment, the commit stage will
>>>> detect an empty rob and will also do the invoke of the interrupt.
>>>> Overwriting that registers and losing the first interrupt vector
>>>> forever.
>>>> Then the first interruption of that routine gets executed (Both
>>>> interrupts
>>>> jump to the same code, so the uPC remains unchanged) and finds the
>>>> vector
>>>> of the new interruption instead of the value stored when it was called.
>>>>
>>>> What I did to solve the issue was to inhibit all the interruptions
>>>> during
>>>> that time frame. It was the fastest thing that I could come up with ...
>>>>
>>>> However, this is a extremely weird race that is extremely unlikely to
>>>> happen. I just had bad luck :).
>>>>
>>>> Hope this info is useful and now is better explained than before.
>>>>
>>>> Thank you all for your hard work on this wonderful tool!!
>>>>
>>>> Best regards,
>>>>
>>>> Emilio
>>>>
>>>> ________________________________________
>>>> De: gem5-dev [gem5-dev-boun...@gem5.org] en nombre de Gabe Black via
>>>> gem5-dev [gem5-dev@gem5.org]
>>>> Enviado: viernes, 28 de noviembre de 2014 21:33
>>>> Para: gem5 Developer List
>>>> Asunto: Re: [gem5-dev] Issue with O3 and interrupts
>>>>
>>>> I feel like there should be a simple solution to this problem, but I
>>>> haven't had the time to really walk through your explanation and
>>>> understand
>>>> it yet.
>>>>
>>>> Gabe
>>>>
>>>> On Fri, Nov 28, 2014 at 9:30 AM, Castillo Villar, Emilio via gem5-dev <
>>>> gem5-dev@gem5.org> wrote:
>>>>
>>>>  Hello,
>>>>>
>>>>> The problem is that the first interrupt calls the invoke function of
>>>>> the
>>>>> X86 Fault implementation arch/x86/fault.cc.
>>>>> This function saves the PC and the interrupt vector in the micro arch
>>>>> registers and calls the code at the Microcode rom
>>>>> (isa/insts/romutil.py)
>>>>> "longModeInterrupt" by setting the uPC. This routine saves the pc in
>>>>> the
>>>>> stack, and calculates the address to the OS interrupt service routine
>>>>>
>>>> using
>>>>
>>>>> the interrupt vector. Then the simulator will forget about this fault.
>>>>>
>>>>> If we have a second interrupt when the first interrupt sets this
>>>>>
>>>> registers
>>>>
>>>>> but hadn't completed the fetch of the first instruction of the
>>>>> longModeInterrupt routine, the O3 CPU will detect an empty rob and will
>>>>> allow this interrupt to proceed. It will overwrite the registers
>>>>> holding
>>>>> the pc and the interrupt vector, that had not the chance of being
>>>>> saved.
>>>>> Therefore the first interruption data will be lost, and when the
>>>>> longModeInterrupt code first instruction arrives, it sees the status
>>>>> (vector) of the second interruption.
>>>>>
>>>>> I did a hack to fix this situation where I completely disable the
>>>>> interrupts during the time window between the set of this registers and
>>>>>
>>>> the
>>>>
>>>>> Microcode rom execution.
>>>>>
>>>>> I added a new register that when set to 0x1, every single interruption
>>>>>
>>>> is
>>>>
>>>>> ignored at the x86/interrupt.cc checkInterrupts function. This had to
>>>>> be
>>>>> done because setting the IF at the flags registers do not disable all
>>>>>
>>>> the
>>>>
>>>>> interruptions. Then I added a new microop at the x86 arch. that sets
>>>>>
>>>> this
>>>>
>>>>> register to 0. I modified the routine that does all the above to call
>>>>>
>>>> this
>>>>
>>>>> new instruction at the end. This way I made it work, its a bit hacky
>>>>> solution so there might be some other elegant ways to solve this issue.
>>>>>
>>>>> Hope this can be helpful.
>>>>>
>>>>> Best regards.
>>>>> ________________________________________
>>>>> De: gem5-dev [gem5-dev-boun...@gem5.org] en nombre de Nilay Vaish via
>>>>> gem5-dev [gem5-dev@gem5.org]
>>>>> Enviado: viernes, 28 de noviembre de 2014 16:03
>>>>> Para: Castillo Villar, Emilio via gem5-dev
>>>>> Asunto: Re: [gem5-dev] Issue with O3 and interrupts
>>>>>
>>>>> Ok, I have not seen this problem, but I got the description below.  So
>>>>> what's your suggestion on fixing the problem?  Should we add a stack of
>>>>> pending interrupts instead of maintaining one single variable?
>>>>>
>>>>> --
>>>>> Nilay
>>>>>
>>>>> On Wed, 26 Nov 2014, Castillo Villar, Emilio via gem5-dev wrote:
>>>>>
>>>>>  Good evening,
>>>>>>
>>>>>> I am experiencing a weird issue with the O3CPU, X86 and the interrupt
>>>>>> handling. I am running in FS mode and one simulation just experienced
>>>>>>
>>>>> a
>>>>
>>>>> weird hang. The simulated machine is doing an spinlock over a value
>>>>>>
>>>>> that
>>>>
>>>>> an interrupt handler writes.
>>>>>>
>>>>>> After some debug I found that when the APIC sends two interruptions to
>>>>>> the cpu in a very short time window, the first interruption is
>>>>>> completely ignored. It can not even complete a commit of the first
>>>>>> instruction in the service routine before all its values get replaced
>>>>>>
>>>>> by
>>>>
>>>>> the next interrupt. After this interrupt completes, the execution goes
>>>>>> back to the application code and do not execute the code for the first
>>>>>> interrupt.
>>>>>>
>>>>>> The problem is that the Lapic has the vector of the first interruption
>>>>>>
>>>>> in the ISR register as it gets restored after the second interruption
>>>>> completes.  Therefore, it thinks that the cpu is currently processing
>>>>>
>>>> that
>>>>
>>>>> interruption, though the cpu went back to execute application code and
>>>>>
>>>> will
>>>>
>>>>> never clear this ISR register.
>>>>>
>>>>>>
>>>>>> The Lapic uses this ISR value to filter incoming interruptions and in
>>>>>>
>>>>> several cases, it does not forward those to the cpu, leading to
>>>>>
>>>> unattended
>>>>
>>>>> interruptions and hangs.
>>>>>
>>>>>>
>>>>>> I have seen this behavior in the kernels' native_flush_tlb_others
>>>>>>
>>>>> function when a page fault happens. The core in charge of executing it,
>>>>> sends an interrupt to all the other cores in the system and it does a
>>>>>
>>>> loop
>>>>
>>>>> checking that every cores receive the interruption. When each core
>>>>>
>>>> receives
>>>>
>>>>> the interruption, they just execute the associated handler  and perform
>>>>>
>>>> a
>>>>
>>>>> write to a variable, notifying the sender that the interruption was
>>>>> processed and the tlb was flushed.
>>>>>
>>>>>>
>>>>>> The problem is that one of the cores is ignoring this interruption,
>>>>>>
>>>>> which has a vector value of 0xf0. I found that this core lapic has a
>>>>>
>>>> value
>>>>
>>>>> of 0xf1 in the ISR, filtering every lower vector.
>>>>>
>>>>>> s
>>>>>> This 0xf1 vector value was set by an interruption that never got to
>>>>>>
>>>>> execute because of the problem explained before, hence the interruption
>>>>> carrying the 0xf0 vector value will never be executed and the
>>>>> native_flush_tlb_others function will not complete.
>>>>>
>>>>>>
>>>>>> I just took a trace of the moment when the 0xf1 interruption gets
>>>>>>
>>>>> dropped. The flags used where Exec, Commit, Faults:
>>>>>
>>>>>>
>>>>>> system.cpu00.interrupts: Interrupt 0xf1 sent to core.
>>>>>> 7175754213000: External Interrupt: RIP 0xffffffff8027a0a0: vector
>>>>>>
>>>>> 0xf1:
>>>>
>>>>> #INTR
>>>>>
>>>>>> 7175754213000: system.cpu00.interrupts: NEW IRR 0 NEW ISR f1.
>>>>>>
>>>>>> Now the interrupt 0xf3 gets to execute and drops all the first
>>>>>>
>>>>> interruption.
>>>>>
>>>>>>
>>>>>> 7175754217000: system.cpu00.interrupts: Got Trigger Interrupt message
>>>>>>
>>>>> with vector 0xf3.
>>>>>
>>>>>> 7175754217000: system.cpu00.interrupts: Interrupt is an Fixed.
>>>>>> 7175754220500: system.cpu00.commit: Interrupt detected.
>>>>>> 7175754220500: system.cpu00.interrupts: Interrupt 0xf3 sent to core.
>>>>>> 7175754220500: External Interrupt: RIP 0xffffffff8027a0a0: vector
>>>>>>
>>>>> 0xf3:
>>>>
>>>>> #INTR
>>>>>
>>>>>> 7175754220500: system.cpu00.interrupts: NEW IRR 0 NEW ISR f3.
>>>>>>
>>>>>> It can be seen how for both interrupts the RIP is the same.
>>>>>>
>>>>>> The first committed instruction after all this sequence of events is
>>>>>>
>>>>>> 7175754228500: system.cpu00 T0 : @handle_mm_fault+992.32768 :
>>>>>>
>>>>>  Microcode_ROM : slli   t4, t1, 0x4 : IntAlu :  D=0x0000000000000f30
>>>>>
>>>>>>
>>>>>> which indeed is from the 0xf3 interrupt.
>>>>>> The cpu executes all the handler and then writes to the APIC EOI
>>>>>>
>>>>> register
>>>>
>>>>>
>>>>>> 7175754418500: system.cpu00.interrupts: Writing Local APIC register 5
>>>>>>
>>>>> at
>>>>
>>>>> offset 0xb0 as 0.
>>>>>
>>>>>> 7175754418500: system.cpu00.interrupts: WRITING TO EOI NEW ISRV IS
>>>>>>
>>>>> 0xf1
>>>>
>>>>>
>>>>>> Here the APIC believes it is servicing the 0xf1 interrupt. However the
>>>>>>
>>>>> cpu goes back to the code it was executing right before the 0xf1
>>>>>
>>>> interrupt,
>>>>
>>>>> and never services it.
>>>>>
>>>>>>
>>>>>> I was wondering if someone has found this issue before.
>>>>>>
>>>>>> Thanks a lot for your time.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> ---------------------------------------
>>>>>>
>>>>>> Emilio Castillo
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-dev@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>>>  _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>>>  _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>>
>>>
>>>  _______________________________________________
>> gem5-dev mailing list
>> gem5-dev@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>  _______________________________________________
> gem5-dev mailing list
> gem5-dev@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Issue with O3 and interrupts

Reply via email to