So, you're right that it has to do with the ROM. I don't know if that's
coincidence or not. The problem area looks like this:
5123508733500: system.cpu T0 : @iret_label.54 : IRET_PROT : srli t7,
t4, 0x4 : IntAlu : D=0x0000000000003b6c
5123508734000: system.cpu T0 : @iret_label.55 : IRET_PROT : xor t7,
t7, t5 : IntAlu : D=0x0000000000003b6c
5123508734000: system.cpu T0 : @iret_label.56 : IRET_PROT : andi t0,
t7, 0x3 : IntAlu : D=0x00000000000000a4
5123508734000: system.cpu T0 : @iret_label.57 : IRET_PROT : br
0x3a : No_OpClass :
5123508734000: system.cpu T0 : @iret_label.58 : IRET_PROT : wrflags
t16, t0, t3 : IntAlu : D=0x0000000000000228
5123508772000: system.cpu T0 : @iret_label.32768 : Microcode_ROM :
slli t4, t1, 0x4 : IntAlu : D=0x0000000000000340
5123508772000: system.cpu T0 : @iret_label.32769 : Microcode_ROM :
ld t2, IDTR:[t4 + 0x8] : MemRead : D=0x00000000ffffffff
A=0xffffffff807cb348
5123508772000: system.cpu T0 : @iret_label.32770 : Microcode_ROM :
ld t4, IDTR:[t4] : MemRead : D=0x80208e000010d5cc A=0xffffffff807cb340
5123508772000: system.cpu T0 : @iret_label.32771 : Microcode_ROM :
chks , t4b, 0x3 : IntAlu :
5123508772000: system.cpu T0 : @iret_label.32772 : Microcode_ROM :
srli t10, t4, 0x10 : IntAlu : D=0x000080208e000010
Hopefully email doesn't mangle that too horribly. Basically, You can see
IRET_PROT (iret in protected mode) going along returning control at the
end of an interrupt generated by the UART, and right when it updates the
flags to say that interrupts are allowed again (I think, and that may be
coincidence) we jump back into the microcode ROM to start vectoring to
the next interrupt. From the implementation of IRET_PROT:
br label("skipSegmentSquashing"), flags=(CEZF,)
# The attribute register needs to keep track of more info before
this will
# work the way it needs to.
# FOR (seg = ES, DS, FS, GS)
# IF ((seg.attr.dpl < cpl && ((seg.attr.type = 'data')
# || (seg.attr.type = 'non-conforming-code')))
# {
# seg = NULL
# }
#}
skipSegmentSquashing:
# Ignore this for now.
#RFLAGS.v = temp_RFLAGS
wrflags t0, t3
# VIF,VIP,IOPL only changed if (old_CPL = 0)
# IF only changed if (old_CPL <= old_RFLAGS.IOPL)
# VM unchanged
# RF cleared
#RIP = temp_RIP
wrip t0, t1, dataSize=ssz
};
You can see that there's a br, a wrflags, and then a wrip. The wrip is
getting skipped, which means first that the branch back to where the
interrupt happened is not happening, and second that the final microop
never executed so the regular PC never rolled forward (npc->pc, which
would enact the branch). Then later the that same iret happens, and
because the interrupt was effectively entered one instruction too early,
instead of returning to wherever the kernel proper was when the
interrupt came in, it actually returns to itself (the iret). It then
attempts to iret again, but since it ended up there by accident and
there's no record of where to go, "Bad Things" (tm) happen. This
particular "Bad Thing" (tm) isn't implemented in gem5, so it panics and
dies.
If we look at the C++ that implements IRET_PROT, specifically the last
couple microops:
microops[57] = new MicroBranchFlags(machInst, macrocodeBlock,
(1ULL << StaticInst::IsMicroop) | (1ULL <<
StaticInst::IsDelayedCommit), label_skipSegmentSquashing,
ConditionTests::EZF);
microops[58] = new Wrflags(machInst, macrocodeBlock,
(1ULL << StaticInst::IsMicroop) | (1ULL <<
StaticInst::IsDelayedCommit), InstRegIndex(NUM_INTREGS+0),
InstRegIndex(NUM_INTREGS+3), InstRegIndex(NUM_INTREGS),
env.dataSize, 0);
microops[59] = new Wrip(machInst, macrocodeBlock,
(1ULL << StaticInst::IsMicroop) | (1ULL <<
StaticInst::IsLastMicroop) | (1ULL << StaticInst::IsSerializing) | (1ULL
<< StaticInst::IsSerializeAfter), InstRegIndex(NUM_INTREGS+0),
InstRegIndex(NUM_INTREGS+1), InstRegIndex(NUM_INTREGS),
env.stackSize, 0);
You'll see those same three microops, br, wrflags, and wrip. The first
two have IsDelayedCommit set as part of their flags, and the third
doesn't. This is as it should be.
Now, if we look at what O3 is doing using trace flags:
5123508734000: system.cpu.fetch: [tid:0]: Instruction PC
0xffffffff80209ae8 (58) created [sn:862856833].
5123508734000: system.cpu.fetch: [tid:0]: Instruction is: IRET_PROT :
wrflags t16, t0, t3
5123508734000: global: DynInst: [sn:862856834] Instruction created.
Instcount for system.cpu = 1306
5123508734000: system.cpu.fetch: [tid:0]: Instruction PC
0xffffffff80209ae8 (59) created [sn:862856834].
5123508734000: system.cpu.fetch: [tid:0]: Instruction is: IRET_PROT :
wrip , t0, t1
5123508734000: system.cpu.BPredUnit: BranchPred: [tid:0]: Branch
predictor predicted 0 for PC
(0xffffffff80209ae8=>0xffffffff80209aea).(59=>60)
5123508734000: system.cpu.BPredUnit: BranchPred: [tid:0]: [sn:862856834]
Creating prediction history for PC
(0xffffffff80209ae8=>0xffffffff80209aea).(59=>60)
5123508734000: system.cpu.BPredUnit: BranchPred: [tid:0]:
[sn:862856834]: History entry added.predHist.size(): 2
5123508734000: system.cpu.fetch: [tid:0]: [sn:862856834]:Branch
predicted to be not taken.
5123508734000: system.cpu.fetch: [tid:0]: [sn:862856834] Branch
predicted to go to (0xffffffff80209aea=>0xffffffff80209af2).(0=>1).
5123508763500: system.cpu.rename: [tid:0]: Sending instructions to IEW.
5123508763500: system.cpu.rename: [tid:0]: Removing [sn:862856834]
PC:(0xffffffff80209ae8=>0xffffffff80209aea).(59=>60) from rename skidBuffer
5123508763500: system.cpu.rename: [tid:0]: Processing instruction
[sn:862856834] with PC (0xffffffff80209ae8=>0xffffffff80209aea).(59=>60).
5123508764000: system.cpu.commit: Interrupt detected.
5123508764000: system.cpu: Interrupt External Interrupt being handled
5123508764000: global: RegFile: Setting int register 23 to 0x34
5123508764000: global: RegFile: Setting int register 140 to
0xffffffff80209ae8
5123508764000: system.cpu.commit: Generating trap event for [tid:0]
5123508764000: system.cpu.commit: Getting instructions from Rename stage.
5123508764000: system.cpu.commit: Instruction PC
(0xffffffff80209ae8=>0xffffffff80209aea).(59=>60) [sn:862856834] [tid:0]
was squashed, skipping.
So looking at that now in light of your explanation, our friend
sn:862856834 is fetched, decoded, and renamed but never entered into the
ROB by the time it's predecessor the wrflags is committed. That's
probably because wrflags is serialize after (I think) and would force
wrip to wait at decode until it finished. So the scheme of waiting for
the ROB to empty seems to be flawed, and we really need to wait until
-all- of the instructions make it through, not just the ones that have
made it to the ROB.
I know I included a lot of information here, but does this all make
sense? Do you think that accurately describes the problem? Any plans to
fix it :-D? Not having to fix this myself would be extra awesome, as
would the fix-ee :-).
Gabe
On 07/06/11 07:26, Geoffrey Blake wrote:
> >From looking at the fetch and commit code, commit should only start
> processing the interrupt after the pipeline has drained itself completely.
> Fetch as it is currently implemented will stop issuing instructions when it
> reaches an instruction 'boundary' as Ali has said thus allowing the
> interrupt to be serviced as the pipe will drain.
>
> Either x86's ISA description is not fully correct in defining
> non-interruptible micro-code sequences, --or--, we may be seeing a race
> condition where fetch stalls while issuing micro-ops (something to do with
> fetching from the ROM? I don't understand how x86 works at all) that can't
> be interrupted and the rob drains in the meantime causing commit to start
> squashing in preparation for the interrupt when it shouldn't.
>
> Geoff
>
> On Wed, Jul 6, 2011 at 8:58 AM, Ali Saidi <[email protected]> wrote:
>
>> Geoff can comment as well, but my guess here is that fetch doesn't
>> understand what an instruction boundary is in x86 possible while
>> instructions are coming out of the micro-code rom. There is a check in fetch
>> that keeps issuing instructions have it receives an interrupt until it
>> reaches a delayedCommit instruction and which point it stops. If this same
>> check wasn't properly implemented for the micro-code fetch part that could
>> be the issue.
>>
>> Ali
>>
>>
>>
>> On Jul 6, 2011, at 5:28 AM, Gabe Black wrote:
>>
>>> I've tracked down the next problem in O3 to an interrupt happening
>>> almost (but not quite) at the end of an instruction. The microop the
>>> interrupt is after has the delayedCommit bit set on it, but commit
>>> doesn't check that at all before it recognizes an interrupt. It relies
>>> on fetch, although I don't really understand how that would work. Geoff
>>> Blake committed a changeset which touched that code, so hopefully he can
>>> explain? Please :-)
>>>
>>> I modified commit to keep track of whether or not the last committed
>>> instruction had delayedCommit set and to delay interrupts, and while
>>> that seemed to work in the sense that the simulation got farther,
>>> something was out of whack and instructions accumulated somewhere until
>>> they exceeded a threshold and O3 died. If you (Geoff?) want to try it
>>> for yourself, I can send you a series of patches (most already out for
>>> review) that will let you get to the point of failure.
>>>
>>> Without actually knowing exactly what's not working properly, I can't
>>> think of any reason this wouldn't affect ARM as well. I guess x86 is
>>> just a lot more microcoded and either happened on the right combination
>>> of circumstances or just the wrong spot so that the problem caused a
>> crash.
>>> Gabe
>>> _______________________________________________
>>> gem5-dev mailing list
>>> [email protected]
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>>
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev