Hi Laszlo,
> -----Original Message----- > From: Laszlo Ersek [mailto:[email protected]] > Sent: Friday, July 20, 2018 1:01 AM > To: Dong, Eric <[email protected]>; [email protected] > Cc: Ni, Ruiyu <[email protected]> > Subject: Re: [edk2] [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant > parameter. > > Hi Eric, > > apologies about the delay. > > On 07/18/18 14:59, Dong, Eric wrote: > > Hi Laszlo, > > > > I finally succeed to setup the OVMF platform which can verify the boot > > failure issue. But on my platform, if I use image build with below > > command (I assume it is used to enable SMM), the system can't boot to > > OS (host OS is fedora 25 and guest OS is Ubuntu 18.04). It hang at OS > > boot phase after ExitBootService point (I can see the console log > > which should been printed at ExitBootService point, so I think hang > > should after this point). > > build -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc -t VS2015x86 -b > > NOOPT -D SMM_REQUIRE -D SECURE_BOOT_ENABLE -D TLS_ENABLE > > > > If I use below command to build the image, the system can boot to OS. > > build -a IA32 -a X64 -p OvmfPkg\OvmfPkgIa32X64.dsc -t VS2015x86 -b > > NOOPT > > > > Does my OVMF environment still has problem? > > > > > > When do the above test, I don't include my two patches. > > Yes, I think this host environment is still problematic. Namely, the latest > QEMU version shipped in Fedora 25 is QEMU-2.7: > > https://koji.fedoraproject.org/koji/buildinfo?buildID=918114 > > and QEMU-2.7 does not have a feature that is important for SMM stability. > This feature is called "SMI broadcast". > > In OVMF, the "OvmfPkg/SmmControl2Dxe" runtime driver implements > EFI_SMM_CONTROL2_PROTOCOL (which is a runtime protocol). The Trigger() > member function raises an SMI, by writing to IO port 0xB2 (ICH9_APM_CNT). > > Originally, QEMU would raise the SMI synchronously only on the sole VCPU > that called Trigger(). Then, the edk2 SMM driver stack would have to pull the > other processors explicitly into SMM (via APIC accesses, if I remember > correctly). This was extremely slow (the processor first raising the SMI would > wait for a long time for the other processors to show up in SMM, before it > would decide to pull them in with APIC writes). Also when we switched the > edk2 SMM sync mode to "relaxed", the results remained very unstable. We > decided that edk2 supported the "traditional" SMM sync mode much better, > and so we implemented "SMI broadcast" in QEMU, to satisfy that sync mode. > > (My memories are a bit fuzzy at this point; you can read more in the following > RH Bugzilla entries: > > https://bugzilla.redhat.com/show_bug.cgi?id=1412327 [QEMU] > https://bugzilla.redhat.com/show_bug.cgi?id=1412313 [OVMF]) > > The idea of "SMI broadcast" is that, regardless of which VCPU triggers the > SMI, QEMU raises the SMI immediately on all VCPUs. This made a > *huge* difference for the performance and the stability of the edk2 SMM > driver stack, used in OVMF and on QEMU/KVM. > > Now, in order to be able to use old OVMF on new QEMU and vice versa, this > feature is runtime-negotiated between "OvmfPkg/SmmControl2Dxe" and > QEMU. (The feature is not enabled by default, and without "SMI broadcast", > the "relaxed" sync method is slightly less broken than the "tradiational" > method, so OVMF defaults to that. With the feature enabled, the "traditional" > mode is better -- that config is the absolute best of all four possible > combinations.) > > More precisely, on the QEMU side, the feature is not tied to a QEMU release, > but to Q35 *machine type versions*. Therefore, in order to benefit from the > feature, you need all of the following: > > - a recent enough OVMF, > - a recent enough QEMU release, > - a recent enough Q35 machine type, specified on the QEMU command line. > > The particular minimum machine type is "pc-q35-2.9" (which is clearly only > provided by QEMU-2.9 and later). The machine type requirement is > automatically satisfied if you use QEMU-2.9+, and just request the "q35" > machine type. (Without an explicit machtype version number, the highest one > supported by the QEMU release will be picked.) > > The lack of this feature in your environment is confirmed by your OVMF > log: > > > NegotiateSmiFeatures: SMI feature negotiation unavailable > > If the feature is available, you will see the following two messages > instead: > > NegotiateSmiFeatures: using SMI broadcast > [...] > AppendFwCfgBootScript: SMI feature negotiation boot script saved > > (The second message only appears if you have S3 enabled -- at S3 resume, the > feature has to be re-enabled, so SmmControl2Dxe saves a boot script > fragment for that.) > > Therefore, please upgrade the host to Fedora 26. In Fedora 26, QEMU 2.9 is > shipped: > > https://koji.fedoraproject.org/koji/buildinfo?buildID=986762 > > ... It's even better if you can upgrade to Fedora 27, as Fedora 27 is the > oldest > Fedora release still supported at this point. The following article describes > the > recommended upgrade method: > > https://fedoraproject.org/wiki/DNF_system_upgrade > I updated the system to fedora 28, but it failed to boot. :( so I borrowed an exited fedora 27 DVD and installed it. With this OS, I can reproduce this issue now. I found this issue is an random issue, I booted 5 times and met the issue. I'm checking the issue. > > Then I include my patches and build the image with SMM enabled, I > > found I can't reproduce the issue you met. I can find the > > "MpInitChangeApLoopCallback done!" message in the console log. > > Attached the console log. > > Yes, I can see "MpInitChangeApLoopCallback() done" in the log. > > > Can you help to verify the OVMF image build from my side? > > Your firmware image (SHA1: a11169ef30ab4d0182dbe2c3fc072b0b2e98c06a) > reproduces the same issue that I reported, on my end. Out of 10 subsequent > attempts, it only succeeded to boot the OS 3 times (attempts #1, #8 and #10). > In the failed cases, the log always ends like this: > > MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8! > RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0! > <HANG> > > That is, one of the APs fails to show up. It always changes which one is > missing; > for example, another failure: > > MpInitChangeApLoopCallback :: Processor 8, Enabled Processor 8! > RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 7 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 4 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 6 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 3 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 5 Enter... MwaitSupport = 0! > <HANG> > > My laptop that I use for testing has 1 socket, 4 cores, and 2 threads. > This is the same VCPU configuration that I use for the guest (hence the > 1 BSP + 7 AP config seen above). I got the idea that perhaps the host was > slightly over-subscribed (= more VCPU work than the physical processors can > serve in "near real time"), and so I changed the guest config to 1 socket, 2 > cores, and 2 threads (= 1 BSP + 3 APs). > Unfortunately, the issue reproduced in this config as well, at the 4th > try: > > MpInitChangeApLoopCallback :: Processor 4, Enabled Processor 4! > RelocateApLoop :: Processor 2 Enter... MwaitSupport = 0! > RelocateApLoop :: Processor 1 Enter... MwaitSupport = 0! > <HANG> > > Just to be sure, I tested a fresh build (without the patches); that booted > the OS > fine (10 out of 10). > > I think something in the code is sensitive to timing, or lacks some kind of > synchronization. One of the APs may sometimes be missed. I guess it's > possible that the SMI broadcast feature, when enabled, helps expose the > problem. > Good message. I'm investigating this issue and will be back when I root caused it. > Thanks, > Laszlo _______________________________________________ edk2-devel mailing list [email protected] https://lists.01.org/mailman/listinfo/edk2-devel

