Hi Laszlo,

> -----Original Message-----
> From: Laszlo Ersek [mailto:[email protected]]
> Sent: Friday, July 20, 2018 1:01 AM
> To: Dong, Eric <[email protected]>; [email protected]
> Cc: Ni, Ruiyu <[email protected]>
> Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant
> parameter.
> 
> Hi Eric,
> 
> apologies about the delay.
> 
> On 07/18/18 14:59, Dong, Eric wrote:
> > Hi Laszlo,
> >
> > I finally succeed to setup the OVMF platform which can verify the boot
> > failure issue.  But on my platform, if I use image build with below
> > command (I assume it is used to enable SMM), the system can't boot to
> > OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS
> > boot phase after ExitBootService point (I can see the console log
> > which should been printed at ExitBootService point, so I think hang
> > should after this point).
> >     build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE
> >
> > If I use below command to build the image, the system can boot to OS.
> >     build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b
> > NOOPT
> >
> > Does my OVMF environment still has problem?
> >
> >
> > When do the above test, I don't include my two patches.
> 
> Yes, I think this host environment is still problematic. Namely, the latest
> QEMU version shipped in Fedora 25 is QEMU-2.7:
> 
>   https://koji.fedoraproject.org/koji/buildinfo?buildID=918114
> 
> and QEMU-2.7 does not have a feature that is important for SMM stability.
> This feature is called "SMI broadcast".
> 
> In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements
> EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger()
> member function raises an SMI, by writing to IO port 0xB2 (ICH9_APM_CNT).
> 
> Originally, QEMU would raise the SMI synchronously only on the sole VCPU
> that called Trigger(). Then, the edk2 SMM driver stack would have to pull the
> other processors explicitly into SMM (via APIC accesses, if I remember
> correctly). This was extremely slow (the processor first raising the SMI would
> wait for a long time for the other processors to show up in SMM, before it
> would decide to pull them in with APIC writes). Also when we switched the
> edk2 SMM sync mode to "relaxed", the results remained very unstable. We
> decided that edk2 supported the "traditional" SMM sync mode much better,
> and so we implemented "SMI broadcast" in QEMU, to satisfy that sync mode.
> 
> (My memories are a bit fuzzy at this point; you can read more in the following
> RH Bugzilla entries:
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU]
>   https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF])
> 
> The idea of "SMI broadcast" is that, regardless of which VCPU triggers the
> SMI, QEMU raises the SMI immediately on all VCPUs. This made a
> *huge* difference for the performance and the stability of the edk2 SMM
> driver stack, used in OVMF and on QEMU/KVM.
> 
> Now, in order to be able to use old OVMF on new QEMU and vice versa, this
> feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and
> QEMU. (The feature is not enabled by default, and without "SMI broadcast",
> the "relaxed" sync method is slightly less broken than the "tradiational"
> method, so OVMF defaults to that. With the feature enabled, the "traditional"
> mode is better -- that config is the absolute best of all four possible
> combinations.)
> 
> More precisely, on the QEMU side, the feature is not tied to a QEMU release,
> but to Q35 *machine type versions*. Therefore, in order to benefit from the
> feature, you need all of the following:
> 
> - a recent enough OVMF,
> - a recent enough QEMU release,
> - a recent enough Q35 machine type, specified on the QEMU command line.
> 
> The particular minimum machine type is "pc-q35-2.9" (which is clearly only
> provided by QEMU-2.9 and later). The machine type requirement is
> automatically satisfied if you use QEMU-2.9+, and just request the "q35"
> machine type. (Without an explicit machtype version number, the highest one
> supported by the QEMU release will be picked.)
> 
> The lack of this feature in your environment is confirmed by your OVMF
> log:
> 
> > NegotiateSmiFeatures: SMI feature negotiation unavailable
> 
> If the feature is available, you will see the following two messages
> instead:
> 
>   NegotiateSmiFeatures: using SMI broadcast
>   [...]
>   AppendFwCfgBootScript: SMI feature negotiation boot script saved
> 
> (The second message only appears if you have S3 enabled -- at S3 resume, the
> feature has to be re-enabled, so SmmControl2Dxe saves a boot script
> fragment for that.)
> 
> Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9 is
> shipped:
> 
>   https://koji.fedoraproject.org/koji/buildinfo?buildID=986762
> 
> ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is the 
> oldest
> Fedora release still supported at this point. The following article describes 
> the
> recommended upgrade method:
> 
>   https://fedoraproject.org/wiki/DNF_system_upgrade
> 

I updated the system to fedora 28, but it failed to boot. :(  so I borrowed an 
exited fedora 27 DVD and installed it. With this OS, I can reproduce this issue 
now. I found this issue is an random issue, I booted 5 times and met the issue. 
 I'm checking the issue.

> > Then I include my patches and build the image with SMM enabled, I
> > found I can't reproduce the issue you met. I can find the
> > "MpInitChangeApLoopCallback done!" message in the console log.
> > Attached the console log.
> 
> Yes, I can see "MpInitChangeApLoopCallback() done" in the log.
> 
> > Can you help to verify the OVMF image build from my side?
> 
> Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a)
> reproduces the same issue that I reported, on my end. Out of 10 subsequent
> attempts, it only succeeded to boot the OS 3 times (attempts #1, #8 and #10).
> In the failed cases, the log always ends like this:
> 
>   MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
>   <HANG>
> 
> That is, one of the APs fails to show up. It always changes which one is 
> missing;
> for example, another failure:
> 
>   MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0!
>   <HANG>
> 
> My laptop that I use for testing has 1 socket, 4 cores, and 2 threads.
> This is the same VCPU configuration that I use for the guest (hence the
> 1 BSP + 7 AP config seen above). I got the idea that perhaps the host was
> slightly over-subscribed (= more VCPU work than the physical processors can
> serve in "near real time"), and so I changed the guest config to 1 socket, 2
> cores, and 2 threads (= 1 BSP + 3 APs).
> Unfortunately, the issue reproduced in this config as well, at the 4th
> try:
> 
>   MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4!
>   RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0!
>   RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0!
>   <HANG>
> 
> Just to be sure, I tested a fresh build (without the patches); that booted 
> the OS
> fine (10 out of 10).
> 
> I think something in the code is sensitive to timing, or lacks some kind of
> synchronization. One of the APs may sometimes be missed. I guess it's
> possible that the SMI broadcast feature, when enabled, helps expose the
> problem.
> 

Good message.  I'm investigating this issue and will be back when I root caused 
it.

> Thanks,
> Laszlo
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to