Ok, I think I understand the problem now. I'll think about it a bit and see if I can come up with good way to fix it.
Gabe On Fri, Nov 28, 2014 at 3:15 PM, Castillo Villar, Emilio via gem5-dev < gem5-dev@gem5.org> wrote: > Dear all, > > Sorry for my poor explanations, I have tried to give as many details as > possible but reading my previous emails they are somehow confusing, and my > english skills are not yet very good. > I will try to expose the problem in a few lines with references to the > actual code. This might be better for the sake of understanding. > > When an interruption arrives, The commit stage stops the fetch of new > instructions and waits till the rob is empty. > Once the rob is empty the commit stage can process the interruption > calling the interrupt->invoke method. > through the cpu->processInterrupts(cpu->getInterrupts()); call at > cpu/o3/commit_impl.hh > > The invoke call will end up executing the code at > X86FaultBase::invoke(ThreadContext * tc, StaticInstPtr inst) > (arch/x86/faults.cc) > > This call does 2 things, it stores the interrupt vector and the actual pc > in two registers: > > tc->setIntReg(INTREG_MICRO(1), vector); > tc->setIntReg(INTREG_MICRO(7), pc); > > And calls the code at arch/x86/isa/insts/romutil.py by setting the uPC: > > entry = extern_label_longModeInterrupt; > ... > pcState.upc(romMicroPC(entry)); > pcState.nupc(romMicroPC(entry) + 1); > tc->pcState(pcState); > > The code at that file just stores some of the processor status in the > stack and jumps to the OS interrupt handler using the interrupt vector > stored at INTREG_MICRO(1) register. > > Now the simulator is done with the interrupt and it gets discarded. Being > its vector only accesible at INTREG_MICRO(1). > > Now there is a very narrow time frame where we have set the register there > to hold the vector as a parameter to the longModeInterrupt routine, but we > hadn't complete a fetch & decode of the first instruction, so the ROB > remains empty. > > If another interruption arrives at that moment, the commit stage will > detect an empty rob and will also do the invoke of the interrupt. > Overwriting that registers and losing the first interrupt vector forever. > Then the first interruption of that routine gets executed (Both interrupts > jump to the same code, so the uPC remains unchanged) and finds the vector > of the new interruption instead of the value stored when it was called. > > What I did to solve the issue was to inhibit all the interruptions during > that time frame. It was the fastest thing that I could come up with ... > > However, this is a extremely weird race that is extremely unlikely to > happen. I just had bad luck :). > > Hope this info is useful and now is better explained than before. > > Thank you all for your hard work on this wonderful tool!! > > Best regards, > > Emilio > > ________________________________________ > De: gem5-dev [gem5-dev-boun...@gem5.org] en nombre de Gabe Black via > gem5-dev [gem5-dev@gem5.org] > Enviado: viernes, 28 de noviembre de 2014 21:33 > Para: gem5 Developer List > Asunto: Re: [gem5-dev] Issue with O3 and interrupts > > I feel like there should be a simple solution to this problem, but I > haven't had the time to really walk through your explanation and understand > it yet. > > Gabe > > On Fri, Nov 28, 2014 at 9:30 AM, Castillo Villar, Emilio via gem5-dev < > gem5-dev@gem5.org> wrote: > > > Hello, > > > > The problem is that the first interrupt calls the invoke function of the > > X86 Fault implementation arch/x86/fault.cc. > > This function saves the PC and the interrupt vector in the micro arch > > registers and calls the code at the Microcode rom (isa/insts/romutil.py) > > "longModeInterrupt" by setting the uPC. This routine saves the pc in the > > stack, and calculates the address to the OS interrupt service routine > using > > the interrupt vector. Then the simulator will forget about this fault. > > > > If we have a second interrupt when the first interrupt sets this > registers > > but hadn't completed the fetch of the first instruction of the > > longModeInterrupt routine, the O3 CPU will detect an empty rob and will > > allow this interrupt to proceed. It will overwrite the registers holding > > the pc and the interrupt vector, that had not the chance of being saved. > > Therefore the first interruption data will be lost, and when the > > longModeInterrupt code first instruction arrives, it sees the status > > (vector) of the second interruption. > > > > I did a hack to fix this situation where I completely disable the > > interrupts during the time window between the set of this registers and > the > > Microcode rom execution. > > > > I added a new register that when set to 0x1, every single interruption is > > ignored at the x86/interrupt.cc checkInterrupts function. This had to be > > done because setting the IF at the flags registers do not disable all the > > interruptions. Then I added a new microop at the x86 arch. that sets this > > register to 0. I modified the routine that does all the above to call > this > > new instruction at the end. This way I made it work, its a bit hacky > > solution so there might be some other elegant ways to solve this issue. > > > > Hope this can be helpful. > > > > Best regards. > > ________________________________________ > > De: gem5-dev [gem5-dev-boun...@gem5.org] en nombre de Nilay Vaish via > > gem5-dev [gem5-dev@gem5.org] > > Enviado: viernes, 28 de noviembre de 2014 16:03 > > Para: Castillo Villar, Emilio via gem5-dev > > Asunto: Re: [gem5-dev] Issue with O3 and interrupts > > > > Ok, I have not seen this problem, but I got the description below. So > > what's your suggestion on fixing the problem? Should we add a stack of > > pending interrupts instead of maintaining one single variable? > > > > -- > > Nilay > > > > On Wed, 26 Nov 2014, Castillo Villar, Emilio via gem5-dev wrote: > > > > > Good evening, > > > > > > I am experiencing a weird issue with the O3CPU, X86 and the interrupt > > > handling. I am running in FS mode and one simulation just experienced a > > > weird hang. The simulated machine is doing an spinlock over a value > that > > > an interrupt handler writes. > > > > > > After some debug I found that when the APIC sends two interruptions to > > > the cpu in a very short time window, the first interruption is > > > completely ignored. It can not even complete a commit of the first > > > instruction in the service routine before all its values get replaced > by > > > the next interrupt. After this interrupt completes, the execution goes > > > back to the application code and do not execute the code for the first > > > interrupt. > > > > > > The problem is that the Lapic has the vector of the first interruption > > in the ISR register as it gets restored after the second interruption > > completes. Therefore, it thinks that the cpu is currently processing > that > > interruption, though the cpu went back to execute application code and > will > > never clear this ISR register. > > > > > > The Lapic uses this ISR value to filter incoming interruptions and in > > several cases, it does not forward those to the cpu, leading to > unattended > > interruptions and hangs. > > > > > > I have seen this behavior in the kernels' native_flush_tlb_others > > function when a page fault happens. The core in charge of executing it, > > sends an interrupt to all the other cores in the system and it does a > loop > > checking that every cores receive the interruption. When each core > receives > > the interruption, they just execute the associated handler and perform a > > write to a variable, notifying the sender that the interruption was > > processed and the tlb was flushed. > > > > > > The problem is that one of the cores is ignoring this interruption, > > which has a vector value of 0xf0. I found that this core lapic has a > value > > of 0xf1 in the ISR, filtering every lower vector. > > > s > > > This 0xf1 vector value was set by an interruption that never got to > > execute because of the problem explained before, hence the interruption > > carrying the 0xf0 vector value will never be executed and the > > native_flush_tlb_others function will not complete. > > > > > > I just took a trace of the moment when the 0xf1 interruption gets > > dropped. The flags used where Exec, Commit, Faults: > > > > > > system.cpu00.interrupts: Interrupt 0xf1 sent to core. > > > 7175754213000: External Interrupt: RIP 0xffffffff8027a0a0: vector 0xf1: > > #INTR > > > 7175754213000: system.cpu00.interrupts: NEW IRR 0 NEW ISR f1. > > > > > > Now the interrupt 0xf3 gets to execute and drops all the first > > interruption. > > > > > > 7175754217000: system.cpu00.interrupts: Got Trigger Interrupt message > > with vector 0xf3. > > > 7175754217000: system.cpu00.interrupts: Interrupt is an Fixed. > > > 7175754220500: system.cpu00.commit: Interrupt detected. > > > 7175754220500: system.cpu00.interrupts: Interrupt 0xf3 sent to core. > > > 7175754220500: External Interrupt: RIP 0xffffffff8027a0a0: vector 0xf3: > > #INTR > > > 7175754220500: system.cpu00.interrupts: NEW IRR 0 NEW ISR f3. > > > > > > It can be seen how for both interrupts the RIP is the same. > > > > > > The first committed instruction after all this sequence of events is > > > > > > 7175754228500: system.cpu00 T0 : @handle_mm_fault+992.32768 : > > Microcode_ROM : slli t4, t1, 0x4 : IntAlu : D=0x0000000000000f30 > > > > > > which indeed is from the 0xf3 interrupt. > > > The cpu executes all the handler and then writes to the APIC EOI > register > > > > > > 7175754418500: system.cpu00.interrupts: Writing Local APIC register 5 > at > > offset 0xb0 as 0. > > > 7175754418500: system.cpu00.interrupts: WRITING TO EOI NEW ISRV IS 0xf1 > > > > > > Here the APIC believes it is servicing the 0xf1 interrupt. However the > > cpu goes back to the code it was executing right before the 0xf1 > interrupt, > > and never services it. > > > > > > I was wondering if someone has found this issue before. > > > > > > Thanks a lot for your time. > > > > > > Best regards, > > > > > > --------------------------------------- > > > > > > Emilio Castillo > > > _______________________________________________ > > > gem5-dev mailing list > > > gem5-dev@gem5.org > > > http://m5sim.org/mailman/listinfo/gem5-dev > > > > > _______________________________________________ > > gem5-dev mailing list > > gem5-dev@gem5.org > > http://m5sim.org/mailman/listinfo/gem5-dev > > _______________________________________________ > > gem5-dev mailing list > > gem5-dev@gem5.org > > http://m5sim.org/mailman/listinfo/gem5-dev > > > _______________________________________________ > gem5-dev mailing list > gem5-dev@gem5.org > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ > gem5-dev mailing list > gem5-dev@gem5.org > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list gem5-dev@gem5.org http://m5sim.org/mailman/listinfo/gem5-dev