On 07/25/18 13:35, Dong, Eric wrote:

>> From: Laszlo Ersek [mailto:[email protected]]
>> Sent: Wednesday, July 25, 2018 6:14 PM
>> To: Dong, Eric <[email protected]>; [email protected]
>> Cc: Ni, Ruiyu <[email protected]>

>> On 07/25/18 05:50, Dong, Eric wrote:

>>> But some AP may wake up later and it failed to return from the
>>> procedure.
>>
>> Ah! So that explains another symptom I've since seen as well --
>> although *very* rarely. Namely, if an AP wakes up *after*
>> PiSmmCpuDxeSmm moves on, thinking that all APs are finished, the AP
>> can execute garbage in "no man's land" -- and that crashes the guest.
>> Basically, QEMU/KVM pause the guest with "emulation failure", and
>> QEMU dumps the VCPU register state to the standard error on the host
>> side. In particular, the register state indicates that the crashed
>> VCPU is *not* in SMM.
>
> Where to get these QEMU dumps info?

QEMU simply writes the register dump to stderr. If you use QEMU together
with libvirt, then QEMU's stderr is saved in:

  /var/log/libvirt/qemu/$DOMAIN.log

Here's an example:

  KVM internal error. Suberror: 1
  emulation failure
  RAX=0000000000000000 RBX=0000000000000005 RCX=000000007eab9828 
RDX=0000000000000002
  RSI=000000000009f000 RDI=000000007eef2ae0 RBP=000000007eef2e40 
RSP=000000007eef2a08
  R8 =0000000000000002 R9 =0000000000000000 R10=0000000000000000 
R11=0000000000000000
  R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 
R15=0000000000000000
  RIP=000000007ea9cdb2 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
  SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
  LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
  TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
  GDT=     000000007ebed998 00000047
  IDT=     000000007e237280 00000fff
  CR0=c0010033 CR2=0000000000000000 CR3=000000007ec01000 CR4=00000668
  DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
  DR6=00000000ffff0ff0 DR7=0000000000000400
  EFER=0000000000000500
  Code=89 d0 5f c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc f3 90 <c3> cc cc 
cc cc cc cc cc cc cc cc cc cc cc 53 89 c8 89 d1 50 0f a2 4c 8b 54 24 38 4d 85 d2

(Anyway, this is just for general discussion now -- your v3 series works
perfectly for me.)

> I'm not clear who calls StartAllAPs function when I send this mail, I
> just based on the debug message found StartAllAps trigged right after
> PiSmmCpuDxeSmm driver. but I think this not blocks my patches for this
> issue. So I sent the patches and back to dig it more.

Sure, I think your analysis was spot-on, regardless of how you produced
it.

The register dumps / emulation failures that I mentioned in my previous
email in this thread are not a new issue with your v3 posting. The v3
series is fine. Instead, these failures were a *further* (but less
frequent) symptom with the *original* series, and I encountered them
separately from the boot hangs that I reported at first.

> Now I found it's PiSmmIpl driver calls gDS->SetMemorySpaceAttributes
> in SmmIplDxeDispatchEventNotify function which finally calls the
> StartAllAps.
> The Ap procedure is SetMtrrsFromBuffer function in CpuDxe driver.  In
> SetMtrrsFromBuffer function, it just call MtrrSetAllMtrrs to update
> the MTRR, and it use local variable to save the MTRR settings.
>
> When the hang issue raised, in the time the AP begin to run the
> procedure, the BSP has leave current function. So the saved MTRR
> setting in the local variable is not valid anymore. So the
> MtrrSetAllMtrrs function use an invalid Mtrr Setttings to set mtrrs
> and never exit the procedure. Do you think it's reasonable?

Absolutely. Basically, if an AP is allowed to continue running the user
procedure after the BSP in MpInitLib thinks that the AP is finished,
anything at all can happen -- the escapes control, it may access memory
that has been freed, and perhaps it might even execute code that has
been freed and overwritten. The actual symptoms can vary wildly.

> I found till the ExitBootService point, the AP still in busy state. I
> check the ApWakeupFunction function, found between the AP procedure
> and set AP state to Idle, no other possible code exist. So I think the
> AP should not return back from procedure. I try to add debug message
> to know where the AP is stopped at. But after I add debug message in
> MtrrSetAllMtrrs, the issue can't reproduce anymore. I tried about 30
> times. Do you have other message to get to know where the AP is?

No, I don't have any other ideas at hand to deduce where the AP was
stuck -- as far as I'm concerned, it had just wandered off into the
weeds :)

I think your idea to check the AP state at the time of the
ExitBootServices handler hang was great.

So, again, in this thread I didn't mean to report a new issue, I just
meant to report another -- less frequent -- symptom of the same "ABA"
issue.

Everything is fine with your v3 posting as much as I can tell (just
please update the commit messages before you push).

Thanks!
Laszlo
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to