Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-07-25 Thread Laszlo Ersek
On 07/25/18 13:35, Dong, Eric wrote:

>> From: Laszlo Ersek [mailto:ler...@redhat.com]
>> Sent: Wednesday, July 25, 2018 6:14 PM
>> To: Dong, Eric ; edk2-devel@lists.01.org
>> Cc: Ni, Ruiyu 

>> On 07/25/18 05:50, Dong, Eric wrote:

>>> But some AP may wake up later and it failed to return from the
>>> procedure.
>>
>> Ah! So that explains another symptom I've since seen as well --
>> although *very* rarely. Namely, if an AP wakes up *after*
>> PiSmmCpuDxeSmm moves on, thinking that all APs are finished, the AP
>> can execute garbage in "no man's land" -- and that crashes the guest.
>> Basically, QEMU/KVM pause the guest with "emulation failure", and
>> QEMU dumps the VCPU register state to the standard error on the host
>> side. In particular, the register state indicates that the crashed
>> VCPU is *not* in SMM.
>
> Where to get these QEMU dumps info?

QEMU simply writes the register dump to stderr. If you use QEMU together
with libvirt, then QEMU's stderr is saved in:

  /var/log/libvirt/qemu/$DOMAIN.log

Here's an example:

  KVM internal error. Suberror: 1
  emulation failure
  RAX= RBX=0005 RCX=7eab9828 
RDX=0002
  RSI=0009f000 RDI=7eef2ae0 RBP=7eef2e40 
RSP=7eef2a08
  R8 =0002 R9 = R10= 
R11=
  R12= R13= R14= 
R15=
  RIP=7ea9cdb2 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =0030   00c09300 DPL=0 DS   [-WA]
  CS =0038   00a09b00 DPL=0 CS64 [-RA]
  SS =0030   00c09300 DPL=0 DS   [-WA]
  DS =0030   00c09300 DPL=0 DS   [-WA]
  FS =0030   00c09300 DPL=0 DS   [-WA]
  GS =0030   00c09300 DPL=0 DS   [-WA]
  LDT=   8200 DPL=0 LDT
  TR =   8b00 DPL=0 TSS64-busy
  GDT= 7ebed998 0047
  IDT= 7e237280 0fff
  CR0=c0010033 CR2= CR3=7ec01000 CR4=0668
  DR0= DR1= DR2= 
DR3=
  DR6=0ff0 DR7=0400
  EFER=0500
  Code=89 d0 5f c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc f3 90  cc cc 
cc cc cc cc cc cc cc cc cc cc cc 53 89 c8 89 d1 50 0f a2 4c 8b 54 24 38 4d 85 d2

(Anyway, this is just for general discussion now -- your v3 series works
perfectly for me.)

> I'm not clear who calls StartAllAPs function when I send this mail, I
> just based on the debug message found StartAllAps trigged right after
> PiSmmCpuDxeSmm driver. but I think this not blocks my patches for this
> issue. So I sent the patches and back to dig it more.

Sure, I think your analysis was spot-on, regardless of how you produced
it.

The register dumps / emulation failures that I mentioned in my previous
email in this thread are not a new issue with your v3 posting. The v3
series is fine. Instead, these failures were a *further* (but less
frequent) symptom with the *original* series, and I encountered them
separately from the boot hangs that I reported at first.

> Now I found it's PiSmmIpl driver calls gDS->SetMemorySpaceAttributes
> in SmmIplDxeDispatchEventNotify function which finally calls the
> StartAllAps.
> The Ap procedure is SetMtrrsFromBuffer function in CpuDxe driver.  In
> SetMtrrsFromBuffer function, it just call MtrrSetAllMtrrs to update
> the MTRR, and it use local variable to save the MTRR settings.
>
> When the hang issue raised, in the time the AP begin to run the
> procedure, the BSP has leave current function. So the saved MTRR
> setting in the local variable is not valid anymore. So the
> MtrrSetAllMtrrs function use an invalid Mtrr Setttings to set mtrrs
> and never exit the procedure. Do you think it's reasonable?

Absolutely. Basically, if an AP is allowed to continue running the user
procedure after the BSP in MpInitLib thinks that the AP is finished,
anything at all can happen -- the escapes control, it may access memory
that has been freed, and perhaps it might even execute code that has
been freed and overwritten. The actual symptoms can vary wildly.

> I found till the ExitBootService point, the AP still in busy state. I
> check the ApWakeupFunction function, found between the AP procedure
> and set AP state to Idle, no other possible code exist. So I think the
> AP should not return back from procedure. I try to add debug message
> to know where the AP is stopped at. But after I add debug message in
> MtrrSetAllMtrrs, the issue can't reproduce anymore. I tried about 30
> times. Do you have other message to get to know where the AP is?

No, I don't have any other ideas at hand to deduce where the AP was
stuck -- as far as I'm concerned, it had just wandered off into the
weeds :)

I think your idea to check the AP state at the time of the

Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-07-25 Thread Dong, Eric
Hi Laszlo,


> -Original Message-
> From: Laszlo Ersek [mailto:ler...@redhat.com]
> Sent: Wednesday, July 25, 2018 6:14 PM
> To: Dong, Eric ; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu 
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> On 07/25/18 05:50, Dong, Eric wrote:
> > Hi Laszlo,
> >
> > I have root cause this issue, the AP hangs in the procedure when
> > PiSmmCpuDxeSmm driver start up trigged this issue.
> >
> > When PiSmmCpuDxeSmm driver start up, it will call StartAllAps to set
> > memory attribute.  In StartAllAps function, after call WakeUpAp to
> > start Aps, it calls CheckAllAps to wait all Aps finished the task. In
> > CheckAllAps function, it detect AP state to know whether the AP has
> > finished its task. In old code, it check whether the AP state is
> > CpuStateFinished to know whether AP has finished tasks. This state is
> > only set by AP when it truly finished task. In new logic,
> > CpuStateFinished been replace with CpuStateIdle. And CpuStateIdle is
> > also the begin state of the AP. AP will change state from CpuStateIdle
> > to CpuStateBusy when it start execute the procedure. And after it
> > finished the procedure, it will change state back to CpuStateIdle.
> >
> > So when the hang issue raised, AP state is not been changed to
> > CpuStateBusy when BSP calls CheckAllAps to check whether the AP has
> > finished its task. So the state for the AP still in CpuStateIdle, but
> > BSP think AP has finished its task. In this case, BSP think all the
> > Aps has finished their tasks and it continues boot.
> 
> Awesome analysis!
> 
> So, this looks like an "inverse" variant of the classic "ABA problem":
> 
> https://en.wikipedia.org/wiki/ABA_problem

Yes, it's an ABA problem. New patch keep the CpuStateReady to distinguish from 
the CpuStateIdle state.

> 
> > But some AP may wake up later and it failed to return from the
> > procedure.
> 
> Ah! So that explains another symptom I've since seen as well -- although
> *very* rarely. Namely, if an AP wakes up *after* PiSmmCpuDxeSmm moves
> on, thinking that all APs are finished, the AP can execute garbage in "no 
> man's
> land" -- and that crashes the guest. Basically, QEMU/KVM pause the guest
> with "emulation failure", and QEMU dumps the VCPU register state to the
> standard error on the host side. In particular, the register state indicates 
> that
> the crashed VCPU is *not* in SMM.

Where to get these QEMU dumps info?

I'm not clear who calls StartAllAPs function when I send this mail, I just 
based on the debug message found StartAllAps trigged right after PiSmmCpuDxeSmm 
driver. but I think this not blocks my patches for this issue. So I sent the 
patches and back to dig it more.

Now I found it's PiSmmIpl driver calls gDS->SetMemorySpaceAttributes in 
SmmIplDxeDispatchEventNotify function which finally calls the StartAllAps.
The Ap procedure is SetMtrrsFromBuffer function in CpuDxe driver.  In 
SetMtrrsFromBuffer function, it just call MtrrSetAllMtrrs to update the MTRR, 
and it use local variable to save the MTRR settings. 

When the hang issue raised, in the time the AP begin to run the procedure, the 
BSP has leave current function. So the saved MTRR setting in the local variable 
is not valid anymore. So the MtrrSetAllMtrrs function use an invalid Mtrr 
Setttings to set mtrrs and never exit the procedure. Do you think it's 
reasonable?

I found till the ExitBootService point, the AP still in busy state. I check the 
ApWakeupFunction function, found between the AP procedure and set AP state to 
Idle, no other possible code exist. So I think the AP should not return back 
from procedure. I try to add debug message to know where the AP is stopped at. 
But after I add debug message in MtrrSetAllMtrrs, the issue can't reproduce 
anymore. I tried about 30 times. Do you have other message to get to know where 
the AP is?

Thanks,
Eric

> 
> When I first encountered this symptom now, while playing some more with
> your patches, it reminded me of earlier problems with MpInitLib. And now
> your analysis makes perfect sense of this additional symptom!
> 
> > In this case, the AP state keeps at CpuStateBusy. So later in
> > ChangeApLoopCallback function, because this AP state still in
> > CpuStateBusy, this AP will not trig the procedure. But BSP wait all
> > APs to trig the procedure(BSP wait the Aps to reduce the
> > mNumberToFinish value in procedure to continue boot) to continue the
> > boot, so the hang occurred.
> 
> This completes the explanation.
> 
> > I think we should keep a middle state to let us know whether the AP
> > truly finished its task. I will send  another serial patch for this
> > issue. Please help to check the new patches.
> 
> Yes, I'll test them too.
> 
> Thanks!
> Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-07-25 Thread Laszlo Ersek
On 07/25/18 05:50, Dong, Eric wrote:
> Hi Laszlo,
>
> I have root cause this issue, the AP hangs in the procedure when
> PiSmmCpuDxeSmm driver start up trigged this issue.
>
> When PiSmmCpuDxeSmm driver start up, it will call StartAllAps to set
> memory attribute.  In StartAllAps function, after call WakeUpAp to
> start Aps, it calls CheckAllAps to wait all Aps finished the task. In
> CheckAllAps function, it detect AP state to know whether the AP has
> finished its task. In old code, it check whether the AP state is
> CpuStateFinished to know whether AP has finished tasks. This state is
> only set by AP when it truly finished task. In new logic,
> CpuStateFinished been replace with CpuStateIdle. And CpuStateIdle is
> also the begin state of the AP. AP will change state from CpuStateIdle
> to CpuStateBusy when it start execute the procedure. And after it
> finished the procedure, it will change state back to CpuStateIdle.
>
> So when the hang issue raised, AP state is not been changed to
> CpuStateBusy when BSP calls CheckAllAps to check whether the AP has
> finished its task. So the state for the AP still in CpuStateIdle, but
> BSP think AP has finished its task. In this case, BSP think all the
> Aps has finished their tasks and it continues boot.

Awesome analysis!

So, this looks like an "inverse" variant of the classic "ABA problem":

https://en.wikipedia.org/wiki/ABA_problem

> But some AP may wake up later and it failed to return from the
> procedure.

Ah! So that explains another symptom I've since seen as well -- although
*very* rarely. Namely, if an AP wakes up *after* PiSmmCpuDxeSmm moves
on, thinking that all APs are finished, the AP can execute garbage in
"no man's land" -- and that crashes the guest. Basically, QEMU/KVM pause
the guest with "emulation failure", and QEMU dumps the VCPU register
state to the standard error on the host side. In particular, the
register state indicates that the crashed VCPU is *not* in SMM.

When I first encountered this symptom now, while playing some more with
your patches, it reminded me of earlier problems with MpInitLib. And now
your analysis makes perfect sense of this additional symptom!

> In this case, the AP state keeps at CpuStateBusy. So later in
> ChangeApLoopCallback function, because this AP state still in
> CpuStateBusy, this AP will not trig the procedure. But BSP wait all
> APs to trig the procedure(BSP wait the Aps to reduce the
> mNumberToFinish value in procedure to continue boot) to continue the
> boot, so the hang occurred.

This completes the explanation.

> I think we should keep a middle state to let us know whether the AP
> truly finished its task. I will send  another serial patch for this
> issue. Please help to check the new patches.

Yes, I'll test them too.

Thanks!
Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-07-24 Thread Dong, Eric
Hi Laszlo,

I have root cause this issue, the AP hangs in the procedure when PiSmmCpuDxeSmm 
driver start up trigged this issue.

When PiSmmCpuDxeSmm driver start up, it will call StartAllAps to set memory 
attribute.  In StartAllAps function, after call WakeUpAp to start Aps, it calls 
CheckAllAps to wait all Aps finished the task. In CheckAllAps function, it 
detect AP state to know whether the AP has finished its task. In old code, it 
check whether the AP state is CpuStateFinished to know whether AP has finished 
tasks. This state is only set by AP when it truly finished task. In new logic, 
CpuStateFinished been replace with CpuStateIdle. And CpuStateIdle is also the 
begin state of the AP. AP will change state from CpuStateIdle to CpuStateBusy 
when it start execute the procedure. And after it finished the procedure, it 
will change state back to CpuStateIdle.

So when the hang issue raised, AP state is not been changed to CpuStateBusy 
when BSP calls CheckAllAps to check whether the AP has finished its task. So 
the state for the AP still in CpuStateIdle, but BSP think AP has finished its 
task. In this case, BSP think all the Aps has finished their tasks and it 
continues boot. But some AP may wake up later and it failed to return from the 
procedure. In this case, the AP state keeps at CpuStateBusy. So later in 
ChangeApLoopCallback function, because this AP state still in CpuStateBusy, 
this AP will not trig the procedure. But BSP wait all APs to trig the 
procedure(BSP wait the Aps to reduce the mNumberToFinish value in procedure to 
continue boot) to continue the boot, so the hang occurred.

I think we should keep a middle state to let us know whether the AP truly 
finished its task. I will send  another serial patch for this issue. Please 
help to check the new patches.

Thanks,
Eric

> -Original Message-
> From: edk2-devel [mailto:edk2-devel-boun...@lists.01.org] On Behalf Of
> Laszlo Ersek
> Sent: Saturday, July 21, 2018 12:30 AM
> To: Dong, Eric ; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu 
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> On 07/20/18 08:53, Dong, Eric wrote:
> >> -Original Message- From: Laszlo Ersek
> >> [mailto:ler...@redhat.com]
> 
> >> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU
> >> 2.9 is shipped:
> >>
> >> https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
> >>
> >> ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is
> >> the oldest Fedora release still supported at this point. The
> >> following article describes the recommended upgrade method:
> >>
> >> https://fedoraproject.org/wiki/DNF_system_upgrade
> >>
> >
> > I updated the system to fedora 28, but it failed to boot. :(  so I
> > borrowed an exited fedora 27 DVD and installed it. With this OS, I can
> > reproduce this issue now. I found this issue is an random issue, I
> > booted 5 times and met the issue.  I'm checking the issue.
> 
> Awesome!
> 
> (I'm not happy about the problem itself, of course, but I'm *very* thankful
> that you took the time to install a Linux box, for testing with
> KVM!!!)
> 
> Laszlo
> ___
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-07-20 Thread Laszlo Ersek
On 07/20/18 08:53, Dong, Eric wrote:
>> -Original Message- From: Laszlo Ersek
>> [mailto:ler...@redhat.com]

>> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU
>> 2.9 is shipped:
>> 
>> https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
>> 
>> ... It's even better if you can upgrade to Fedora 27, as Fedora 27
>> is the oldest Fedora release still supported at this point. The
>> following article describes the recommended upgrade method:
>> 
>> https://fedoraproject.org/wiki/DNF_system_upgrade
>> 
> 
> I updated the system to fedora 28, but it failed to boot. :(  so I
> borrowed an exited fedora 27 DVD and installed it. With this OS, I
> can reproduce this issue now. I found this issue is an random issue,
> I booted 5 times and met the issue.  I'm checking the issue.

Awesome!

(I'm not happy about the problem itself, of course, but I'm *very*
thankful that you took the time to install a Linux box, for testing with
KVM!!!)

Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-07-20 Thread Dong, Eric
Hi Laszlo,


> -Original Message-
> From: Laszlo Ersek [mailto:ler...@redhat.com]
> Sent: Friday, July 20, 2018 1:01 AM
> To: Dong, Eric ; edk2-devel@lists.01.org
> Cc: Ni, Ruiyu 
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> Hi Eric,
> 
> apologies about the delay.
> 
> On 07/18/18 14:59, Dong, Eric wrote:
> > Hi Laszlo,
> >
> > I finally succeed to setup the OVMF platform which can verify the boot
> > failure issue.  But on my platform, if I use image build with below
> > command (I assume it is used to enable SMM), the system can't boot to
> > OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> > boot phase after ExitBootService point (I can see the console log
> > which should been printed at ExitBootService point, so I think hang
> > should after this point).
> > build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
> >
> > If I use below command to build the image, the system can boot to OS.
> > build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT
> >
> > Does my OVMF environment still has problem?
> >
> >
> > When do the above test, I don't include my two patches.
> 
> Yes, I think this host environment is still problematic. Namely, the latest
> QEMU version shipped in Fedora 25 is QEMU-2.7:
> 
>   https://koji.fedoraproject.org/koji/buildinfo?buildID=918114
> 
> and QEMU-2.7 does not have a feature that is important for SMM stability.
> This feature is called "SMI broadcast".
> 
> In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
> EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
> member function raises an SMI, by writing to IO port 0xB2 (ICH9_APM_CNT).
> 
> Originally, QEMU would raise the SMI synchronously only on the sole VCPU
> that called Trigger(). Then, the edk2 SMM driver stack would have to pull the
> other processors explicitly into SMM (via APIC accesses, if I remember
> correctly). This was extremely slow (the processor first raising the SMI would
> wait for a long time for the other processors to show up in SMM, before it
> would decide to pull them in with APIC writes). Also when we switched the
> edk2 SMM sync mode to "relaxed", the results remained very unstable. We
> decided that edk2 supported the "traditional" SMM sync mode much better,
> and so we implemented "SMI broadcast" in QEMU, to satisfy that sync mode.
> 
> (My memories are a bit fuzzy at this point; you can read more in the following
> RH Bugzilla entries:
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])
> 
> The idea of "SMI broadcast" is that, regardless of which VCPU triggers the
> SMI, QEMU raises the SMI immediately on all VCPUs. This made a
> *huge* difference for the performance and the stability of the edk2 SMM
> driver stack, used in OVMF and on QEMU/KVM.
> 
> Now, in order to be able to use old OVMF on new QEMU and vice versa, this
> feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
> QEMU. (The feature is not enabled by default, and without "SMI broadcast",
> the "relaxed" sync method is slightly less broken than the "tradiational"
> method, so OVMF defaults to that. With the feature enabled, the "traditional"
> mode is better -- that config is the absolute best of all four possible
> combinations.)
> 
> More precisely, on the QEMU side, the feature is not tied to a QEMU release,
> but to Q35 *machine type versions*. Therefore, in order to benefit from the
> feature, you need all of the following:
> 
> - a recent enough OVMF,
> - a recent enough QEMU release,
> - a recent enough Q35 machine type, specified on the QEMU command line.
> 
> The particular minimum machine type is "pc-q35-2.9" (which is clearly only
> provided by QEMU-2.9 and later). The machine type requirement is
> automatically satisfied if you use QEMU-2.9+, and just request the "q35"
> machine type. (Without an explicit machtype version number, the highest one
> supported by the QEMU release will be picked.)
> 
> The lack of this feature in your environment is confirmed by your OVMF
> log:
> 
> > NegotiateSmiFeatures: SMI feature negotiation unavailable
> 
> If the feature is available, you will see the following two messages
> instead:
> 
>   NegotiateSmiFeatures: using SMI broadcast
>   [...]
>   AppendFwCfgBootScript: SMI feature negot

Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-07-19 Thread Laszlo Ersek
Hi Eric,

apologies about the delay.

On 07/18/18 14:59, Dong, Eric wrote:
> Hi Laszlo,
>
> I finally succeed to setup the OVMF platform which can verify the boot
> failure issue.  But on my platform, if I use image build with below
> command (I assume it is used to enable SMM), the system can't boot to
> OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> boot phase after ExitBootService point (I can see the console log
> which should been printed at ExitBootService point, so I think hang
> should after this point).
>   build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b 
> NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
>
> If I use below command to build the image, the system can boot to OS.
>   build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b NOOPT
>
> Does my OVMF environment still has problem?
>
>
> When do the above test, I don't include my two patches.

Yes, I think this host environment is still problematic. Namely, the
latest QEMU version shipped in Fedora 25 is QEMU-2.7:

  https://koji.fedoraproject.org/koji/buildinfo?buildID=918114

and QEMU-2.7 does not have a feature that is important for SMM
stability. This feature is called "SMI broadcast".

In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
member function raises an SMI, by writing to IO port 0xB2
(ICH9_APM_CNT).

Originally, QEMU would raise the SMI synchronously only on the sole VCPU
that called Trigger(). Then, the edk2 SMM driver stack would have to
pull the other processors explicitly into SMM (via APIC accesses, if I
remember correctly). This was extremely slow (the processor first
raising the SMI would wait for a long time for the other processors to
show up in SMM, before it would decide to pull them in with APIC
writes). Also when we switched the edk2 SMM sync mode to "relaxed", the
results remained very unstable. We decided that edk2 supported the
"traditional" SMM sync mode much better, and so we implemented "SMI
broadcast" in QEMU, to satisfy that sync mode.

(My memories are a bit fuzzy at this point; you can read more in the
following RH Bugzilla entries:

  https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
  https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])

The idea of "SMI broadcast" is that, regardless of which VCPU triggers
the SMI, QEMU raises the SMI immediately on all VCPUs. This made a
*huge* difference for the performance and the stability of the edk2 SMM
driver stack, used in OVMF and on QEMU/KVM.

Now, in order to be able to use old OVMF on new QEMU and vice versa,
this feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
QEMU. (The feature is not enabled by default, and without "SMI
broadcast", the "relaxed" sync method is slightly less broken than the
"tradiational" method, so OVMF defaults to that. With the feature
enabled, the "traditional" mode is better -- that config is the absolute
best of all four possible combinations.)

More precisely, on the QEMU side, the feature is not tied to a QEMU
release, but to Q35 *machine type versions*. Therefore, in order to
benefit from the feature, you need all of the following:

- a recent enough OVMF,
- a recent enough QEMU release,
- a recent enough Q35 machine type, specified on the QEMU command line.

The particular minimum machine type is "pc-q35-2.9" (which is clearly
only provided by QEMU-2.9 and later). The machine type requirement is
automatically satisfied if you use QEMU-2.9+, and just request the "q35"
machine type. (Without an explicit machtype version number, the highest
one supported by the QEMU release will be picked.)

The lack of this feature in your environment is confirmed by your OVMF
log:

> NegotiateSmiFeatures: SMI feature negotiation unavailable

If the feature is available, you will see the following two messages
instead:

  NegotiateSmiFeatures: using SMI broadcast
  [...]
  AppendFwCfgBootScript: SMI feature negotiation boot script saved

(The second message only appears if you have S3 enabled -- at S3 resume,
the feature has to be re-enabled, so SmmControl2Dxe saves a boot script
fragment for that.)

Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9
is shipped:

  https://koji.fedoraproject.org/koji/buildinfo?buildID=986762

... It's even better if you can upgrade to Fedora 27, as Fedora 27 is
the oldest Fedora release still supported at this point. The following
article describes the recommended upgrade method:

  https://fedoraproject.org/wiki/DNF_system_upgrade

> Then I include my patches and build the image with SMM enabled, I
> found I can't reproduce the issue you met. I can find the
> "MpInitChangeApLoopCallback done!" message in the console log.
> Attached the console log.

Yes, I can see "MpInitChangeApLoopCallback() done" in the log.

> Can you help to verify the OVMF image build from my side?

Your firmware image (SHA1: 

Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-06-29 Thread Laszlo Ersek
Hi Eric,

On 06/29/18 05:20, Eric Dong wrote:
> Parameter StartCount duplicates with RunningCount. After this change,
> RunningCount means the running AP count.
>
> V2 changes: Remove volatile for RunningCount.
>
> Done Test:
> 1.PI SCT Test
> 2.Boot OS / S3
>
> Cc: Ruiyu Ni 
> Cc: Jeff Fan 
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Eric Dong 
> ---
>  UefiCpuPkg/Library/MpInitLib/MpLib.c | 11 +--
>  UefiCpuPkg/Library/MpInitLib/MpLib.h |  3 +--
>  2 files changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.c 
> b/UefiCpuPkg/Library/MpInitLib/MpLib.c
> index 3945771764..52c9679099 100644
> --- a/UefiCpuPkg/Library/MpInitLib/MpLib.c
> +++ b/UefiCpuPkg/Library/MpInitLib/MpLib.c
> @@ -1400,7 +1400,7 @@ CheckAllAPs (
>  // value of state after setting the it to CpuStateFinished, so BSP can 
> safely make use of its value.
>  //
>  if (GetApState(CpuData) != CpuStateBusy) {
> -  CpuMpData->RunningCount ++;
> +  CpuMpData->RunningCount --;
>CpuMpData->CpuData[ProcessorNumber].Waiting = FALSE;
>
>//
> @@ -1425,7 +1425,7 @@ CheckAllAPs (
>//
>// If all APs finish, return EFI_SUCCESS.
>//
> -  if (CpuMpData->RunningCount == CpuMpData->StartCount) {
> +  if (CpuMpData->RunningCount == 0) {
>  return EFI_SUCCESS;
>}
>
> @@ -1442,7 +1442,7 @@ CheckAllAPs (
>  //
>  if (CpuMpData->FailedCpuList != NULL) {
>*CpuMpData->FailedCpuList =
> - AllocatePool ((CpuMpData->StartCount - CpuMpData->FinishedCount + 
> 1) * sizeof (UINTN));
> + AllocatePool ((CpuMpData->RunningCount + 1) * sizeof (UINTN));
>ASSERT (*CpuMpData->FailedCpuList != NULL);
>  }
>  ListIndex = 0;
> @@ -2121,7 +2121,7 @@ StartupAllAPsWorker (
>  return EFI_NOT_STARTED;
>}
>
> -  CpuMpData->StartCount = 0;
> +  CpuMpData->RunningCount = 0;
>for (ProcessorNumber = 0; ProcessorNumber < ProcessorCount; 
> ProcessorNumber++) {
>  CpuData = >CpuData[ProcessorNumber];
>  CpuData->Waiting = FALSE;
> @@ -2131,7 +2131,7 @@ StartupAllAPsWorker (
>  // Mark this processor as responsible for current calling.
>  //
>  CpuData->Waiting = TRUE;
> -CpuMpData->StartCount++;
> +CpuMpData->RunningCount++;
>}
>  }
>}
> @@ -2140,7 +2140,6 @@ StartupAllAPsWorker (
>CpuMpData->ProcArguments = ProcedureArgument;
>CpuMpData->SingleThread  = SingleThread;
>CpuMpData->FinishedCount = 0;
> -  CpuMpData->RunningCount  = 0;
>CpuMpData->FailedCpuList = FailedCpuList;
>CpuMpData->ExpectedTime  = CalculateTimeout (
> TimeoutInMicroseconds,
> diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.h 
> b/UefiCpuPkg/Library/MpInitLib/MpLib.h
> index 90c09fb8fb..ad62acf766 100644
> --- a/UefiCpuPkg/Library/MpInitLib/MpLib.h
> +++ b/UefiCpuPkg/Library/MpInitLib/MpLib.h
> @@ -210,9 +210,8 @@ struct _CPU_MP_DATA {
>UINTN  BackupBuffer;
>UINTN  BackupBufferSize;
>
> -  volatile UINT32StartCount;
>volatile UINT32FinishedCount;
> -  volatile UINT32RunningCount;
> +  UINT32 RunningCount;
>BOOLEANSingleThread;
>EFI_AP_PROCEDURE   Procedure;
>VOID   *ProcArguments;
>

I got confused by the way you sent out this patch.

First I thought that you meant it separately (stand-alone). I tried to
test it, but it didn't apply. Also my intent was to ask you whether you
wanted to send a new version of

  [edk2] [Patch 1/2] UefiCpuPkg/MpInitLib: Not use disabled AP.


However, upon seeing that this patch wouldn't apply, I'm now thinking
that you would like to preserve

  [edk2] [Patch 1/2] UefiCpuPkg/MpInitLib: Not use disabled AP.

without any changes, and your intent is to only update

  [edk2] [Patch 2/2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

to version 2.


In such cases, please do not post an independent patch. Instead, pick
one of the following:

- Repost the entire series as v2, and mark in the cover letter that
  patch #1 is unchanged from the v1 posting.

- Alternatively, post the patch in response to the *original* v1 thread
  (using --in-reply-to= with git-send-email), and use the subject

  [edk2] [Patch v2 2/2] UefiCpuPkg/MpInitLib: Remove redundant
parameter.

Either of these approaches will let reviewers know that you intend the
two patches to go together (and in what order), and that only patch #2
has been updated.


So... Assuming this was indeed your intent, I applied the following two
patches locally, for testing:

  [edk2] [Patch 1_2] UefiCpuPkg_MpInitLib: Not use disabled AP.
  Message-Id: <20180628112920.5296-1-eric.d...@intel.com>
  20180628112920.5296-1-eric.dong@intel.com">http://mid.mail-archive.com/20180628112920.5296-1-eric.dong@intel.com

  [edk2] 

[edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter.

2018-06-28 Thread Eric Dong
Parameter StartCount duplicates with RunningCount. After this change, 
RunningCount means the running AP count.

V2 changes: Remove volatile for RunningCount.

Done Test:
1.PI SCT Test
2.Boot OS / S3

Cc: Ruiyu Ni 
Cc: Jeff Fan 
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Eric Dong 
---
 UefiCpuPkg/Library/MpInitLib/MpLib.c | 11 +--
 UefiCpuPkg/Library/MpInitLib/MpLib.h |  3 +--
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.c 
b/UefiCpuPkg/Library/MpInitLib/MpLib.c
index 3945771764..52c9679099 100644
--- a/UefiCpuPkg/Library/MpInitLib/MpLib.c
+++ b/UefiCpuPkg/Library/MpInitLib/MpLib.c
@@ -1400,7 +1400,7 @@ CheckAllAPs (
 // value of state after setting the it to CpuStateFinished, so BSP can 
safely make use of its value.
 //
 if (GetApState(CpuData) != CpuStateBusy) {
-  CpuMpData->RunningCount ++;
+  CpuMpData->RunningCount --;
   CpuMpData->CpuData[ProcessorNumber].Waiting = FALSE;
 
   //
@@ -1425,7 +1425,7 @@ CheckAllAPs (
   //
   // If all APs finish, return EFI_SUCCESS.
   //
-  if (CpuMpData->RunningCount == CpuMpData->StartCount) {
+  if (CpuMpData->RunningCount == 0) {
 return EFI_SUCCESS;
   }
 
@@ -1442,7 +1442,7 @@ CheckAllAPs (
 //
 if (CpuMpData->FailedCpuList != NULL) {
   *CpuMpData->FailedCpuList =
- AllocatePool ((CpuMpData->StartCount - CpuMpData->FinishedCount + 1) 
* sizeof (UINTN));
+ AllocatePool ((CpuMpData->RunningCount + 1) * sizeof (UINTN));
   ASSERT (*CpuMpData->FailedCpuList != NULL);
 }
 ListIndex = 0;
@@ -2121,7 +2121,7 @@ StartupAllAPsWorker (
 return EFI_NOT_STARTED;
   }
 
-  CpuMpData->StartCount = 0;
+  CpuMpData->RunningCount = 0;
   for (ProcessorNumber = 0; ProcessorNumber < ProcessorCount; 
ProcessorNumber++) {
 CpuData = >CpuData[ProcessorNumber];
 CpuData->Waiting = FALSE;
@@ -2131,7 +2131,7 @@ StartupAllAPsWorker (
 // Mark this processor as responsible for current calling.
 //
 CpuData->Waiting = TRUE;
-CpuMpData->StartCount++;
+CpuMpData->RunningCount++;
   }
 }
   }
@@ -2140,7 +2140,6 @@ StartupAllAPsWorker (
   CpuMpData->ProcArguments = ProcedureArgument;
   CpuMpData->SingleThread  = SingleThread;
   CpuMpData->FinishedCount = 0;
-  CpuMpData->RunningCount  = 0;
   CpuMpData->FailedCpuList = FailedCpuList;
   CpuMpData->ExpectedTime  = CalculateTimeout (
TimeoutInMicroseconds,
diff --git a/UefiCpuPkg/Library/MpInitLib/MpLib.h 
b/UefiCpuPkg/Library/MpInitLib/MpLib.h
index 90c09fb8fb..ad62acf766 100644
--- a/UefiCpuPkg/Library/MpInitLib/MpLib.h
+++ b/UefiCpuPkg/Library/MpInitLib/MpLib.h
@@ -210,9 +210,8 @@ struct _CPU_MP_DATA {
   UINTN  BackupBuffer;
   UINTN  BackupBufferSize;
 
-  volatile UINT32StartCount;
   volatile UINT32FinishedCount;
-  volatile UINT32RunningCount;
+  UINT32 RunningCount;
   BOOLEANSingleThread;
   EFI_AP_PROCEDURE   Procedure;
   VOID   *ProcArguments;
-- 
2.15.0.windows.1

___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel