On 01/18/17 11:03, Igor Mammedov wrote: > On Tue, 17 Jan 2017 19:53:21 +0100 > Laszlo Ersek <ler...@redhat.com> wrote: > >> On 01/17/17 15:20, Igor Mammedov wrote: >>> On Tue, 17 Jan 2017 14:46:05 +0100 >>> Laszlo Ersek <ler...@redhat.com> wrote: >>> >>>> On 01/17/17 14:20, Igor Mammedov wrote: >>>>> On Tue, 17 Jan 2017 13:06:13 +0100 >>>>> Laszlo Ersek <ler...@redhat.com> wrote: >>>>> >>>>>> On 01/17/17 12:21, Igor Mammedov wrote: >>>>>>> On Mon, 16 Jan 2017 21:46:55 +0100 >>>>>>> Laszlo Ersek <ler...@redhat.com> wrote: >>>>>>> >>>>>>>> On 01/13/17 14:09, Igor Mammedov wrote: >>>>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100 >>>>>>>>> Laszlo Ersek <ler...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> The generic edk2 SMM infrastructure prefers >>>>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each >>>>>>>>>> processor. If >>>>>>>>>> Trigger() only brings the current processor into SMM, then edk2 >>>>>>>>>> handles it >>>>>>>>>> in the following ways: >>>>>>>>>> >>>>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before >>>>>>>>>> ExitBootServices(), but is not necessarily true at runtime), >>>>>>>>>> then: >>>>>>>>>> >>>>>>>>>> (a) If edk2 has been configured for "traditional" SMM >>>>>>>>>> synchronization, >>>>>>>>>> then the BSP sends directed SMIs to the APs with APIC >>>>>>>>>> delivery, >>>>>>>>>> bringing them into SMM individually. Then the BSP runs the >>>>>>>>>> SMI >>>>>>>>>> handler / dispatcher. >>>>>>>>>> >>>>>>>>>> (b) If edk2 has been configured for "relaxed" SMM >>>>>>>>>> synchronization, >>>>>>>>>> then the APs that are not already in SMM are not brought in, >>>>>>>>>> and >>>>>>>>>> the BSP runs the SMI handler / dispatcher. >>>>>>>>>> >>>>>>>>>> (2) If Trigger() is executed by an AP (which is possible after >>>>>>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 >>>>>>>>>> efibootmgr"), then the AP in question brings in the BSP with a >>>>>>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. >>>>>>>>>> >>>>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP >>>>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" >>>>>>>>>> command from (2) can take more than 3 seconds to complete, because >>>>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively. >>>>>>>>>> >>>>>>>>>> The larger problem is that QEMU's current behavior diverges from the >>>>>>>>>> behavior usually seen on physical hardware, and that keeps exposing >>>>>>>>>> obscure corner cases, race conditions and other instabilities in >>>>>>>>>> edk2, >>>>>>>>>> which generally expects / prefers a software SMI to affect all CPUs >>>>>>>>>> at >>>>>>>>>> once. >>>>>>>>>> >>>>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to >>>>>>>>>> inject >>>>>>>>>> the SMI on all VCPUs. >>>>>>>>>> >>>>>>>>>> While the original posting of this patch >>>>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> >>>>>>>>>> only intended to speed up (2), based on our recent "stress testing" >>>>>>>>>> of SMM >>>>>>>>>> this patch actually provides functional improvements. >>>>>>>>>> >>>>>>>>>> Cc: "Michael S. Tsirkin" <m...@redhat.com> >>>>>>>>>> Cc: Gerd Hoffmann <kra...@redhat.com> >>>>>>>>>> Cc: Igor Mammedov <imamm...@redhat.com> >>>>>>>>>> Cc: Paolo Bonzini <pbonz...@redhat.com> >>>>>>>>>> Signed-off-by: Laszlo Ersek <ler...@redhat.com> >>>>>>>>>> Reviewed-by: Michael S. Tsirkin <m...@redhat.com> >>>>>>>>>> --- >>>>>>>>>> >>>>>>>>>> Notes: >>>>>>>>>> v6: >>>>>>>>>> - no changes, pick up Michael's R-b >>>>>>>>>> >>>>>>>>>> v5: >>>>>>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the >>>>>>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for >>>>>>>>>> DEFINE_PROP_BIT() in the next patch) >>>>>>>>>> >>>>>>>>>> include/hw/i386/ich9.h | 3 +++ >>>>>>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- >>>>>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h >>>>>>>>>> index da1118727146..18dcca7ebcbf 100644 >>>>>>>>>> --- a/include/hw/i386/ich9.h >>>>>>>>>> +++ b/include/hw/i386/ich9.h >>>>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); >>>>>>>>>> #define ICH9_SMB_HST_D1 0x06 >>>>>>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 >>>>>>>>>> >>>>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ >>>>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 >>>>>>>>>> + >>>>>>>>>> #endif /* HW_ICH9_H */ >>>>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c >>>>>>>>>> index 376b7801a42c..ced6f803a4f2 100644 >>>>>>>>>> --- a/hw/isa/lpc_ich9.c >>>>>>>>>> +++ b/hw/isa/lpc_ich9.c >>>>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, >>>>>>>>>> void *arg) >>>>>>>>>> >>>>>>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ >>>>>>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { >>>>>>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>>>>>>>> + if (lpc->smi_negotiated_features & >>>>>>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { >>>>>>>>>> + CPUState *cs; >>>>>>>>>> + CPU_FOREACH(cs) { >>>>>>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>>>>>>> + } >>>>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? >>>>>>>>> >>>>>>>> >>>>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry >>>>>>>> code >>>>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same >>>>>>>> code area for execution. The VCPUs are told apart from each other by >>>>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or >>>>>>>> search criterion into a global shared array. Once they find their >>>>>>>> respective private data blocks, the VCPUs don't step on each other's >>>>>>>> toes. >>>>>>> That's what I'm not sure. >>>>>>> If broadcast SMI is sent before relocation all CPUs will use >>>>>>> the same SMBASE and as result save their state in the same >>>>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting >>>>>>> each other's saved state >>>>>> >>>>>> Good point! >>>>>> >>>>>>> and then on exit from SMI they all will restore >>>>>>> whatever state that race produced so it seems plain wrong thing to do. >>>>>>> >>>>>>> Are you sure that edk2 does broadcast for SMI initialization or does it >>>>>>> using directed SMIs? >>>>>> >>>>>> Hmmm, you are right. The SmmRelocateBases() function in >>>>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for >>>>>> each individual AP in succession, by sending SMI IPIs to them, one by >>>>>> one. Then it does the same for the BSP, by sending itself a similar SMI >>>>>> IPI. >>>>>> >>>>>> The above QEMU code is only exercised much later, after the SMBASE >>>>>> relocation, with the SMM stack up and running. It is used when a >>>>>> (preferably broadcast) SMI is triggered for firmware services (for >>>>>> example, the UEFI variable services). >>>>>> >>>>>> So, you are right. >>>>>> >>>>>> Considering edk2 only, the difference shouldn't matter -- when this code >>>>>> is invoked (via an IO port write), the SMBASEs will all have been >>>>>> relocated. >>>>>> >>>>>> Also, I've been informed that on real hardware, writing to APM_CNT >>>>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values. >>>>>> So I believe the above code should be closest to real hardware, and the >>>>>> pre-patch code was only written in unicast form for SeaBIOS's sake. >>>>>> >>>>>> I think the code is safe above. If the guest injects an SMI via APM_CNT >>>>>> after negotiating SMI broadcast, it should have not left any VCPUs >>>>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In >>>>>> the future we can tweak this further, if necessary; we have 63 more >>>>>> bits, and we'll be able to reject invalid combinations of bits. >>>>>> >>>>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed >>>>>> from the default? >>>>>> >>>>>> If so, doesn't that run the risk of missing a VCPU that has an actual >>>>>> SMI handler installed, and it just happens to be placed at the default >>>>>> SMBASE location? >>>>>> >>>>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, >>>>>> but I think (a) that's not what real hardware does, and (b) it is risky >>>>>> if a VCPU's actual SMI handler has been installed while keeping the >>>>>> default SMBASE value. >>>>>> >>>>>> What do you think? >>>>> (a) probably doesn't consider hotplug, so it's totally fine for the case >>>>> where firmware is the only one who wakes up/initializes CPUs. >>>>> however consider 2 CPU hotplugged and then broadcast SMI happens >>>>> to serve some other task (if there isn't unpark 'feature'). >>>>> Even if real hw does it, it's behavior not defined by SDM with possibly >>>>> bad consequences. >>>>> >>>>> (b) alone it's risky so I'd skip based on combination of >>>>> >>>>> if (default SMBASE & CPU is in reset state) >>>>> skip; >>>>> >>>>> that should be safe and it leaves open possibility to avoid adding >>>>> unpark 'feature' to CPU. >>>> >>>> The above attributes ("SMBASE" and "CPU is in reset state") are specific >>>> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the >>>> X86_CPU() macro? >>>> >>>> Can I assume here that the macro will never return NULL? (I think so, >>>> LPC is only used with x86.) >>> yep, that should work. >>> >>>> >>>> ... I guess SMBASE can be accessed as >>>> >>>> X86_CPU(cs)->env.smbase >>> preferred style: >>> X86CPU *cpu = X86_CPU(cs); >>> cpu->... >>> >>>> >>>> How do I figure out if the CPU is in reset state ("waiting for first >>>> INIT") though? Note that this state should be distinguished from simply >>>> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected >>>> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send >>>> APs to sleep. >>> >>> how about using EIP reset value? >>> x86_cpu_reset() >>> ... >>> env->eip = 0xfff0; >> >> Okay, so I tried this. It doesn't work. >> >> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() >> function from ich9_apm_ctrl_changed(), due to size reasons): >> >>> static void ich9_apm_broadcast_smi(void) >>> { >>> CPUState *cs; >>> >>> cpu_synchronize_all_states(); /* <--------- mark this */ > it probably should be executed after pause_all_vcpus(), > maybe it will help. > >>> CPU_FOREACH(cs) { >>> X86CPU *cpu = X86_CPU(cs); >>> CPUX86State *env = &cpu->env; >>> >>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>> CPUClass *k = CPU_GET_CLASS(cs); >>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>> >>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>> continue; >>> } >>> >>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>> } >>> } >> > [...] >> (b) If I add the cpu_synchronize_all_states() call before the loop, then >> the incorrect filter matches go away; QEMU sees the KVM VCPU states >> correctly, and the SMI is broad-cast. >> >> However, in this case, the boot slows to a *crawl*. It's unbearable. All >> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log >> with the naked eye, almost line by line. >> I didn't expect that cpu_synchronize_all_states() would be so costly, >> but it makes the idea of checking VCPU state in the APM_CNT handler a >> non-starter. > I wonder were bottleneck is to slow down guest so much. > What is actual cost of calling cpu_synchronize_all_states()? > > Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() > would help.
If I prepend just a pause_all_vcpus() function call, at the top, then the VM freezes (quite understandably) when the first SMI is raised via APM_CNT. If I, instead, bracket the function body with pause_all_vcpus() and resume_all_vcpus(), like this: static void ich9_apm_broadcast_smi(void) { CPUState *cs; pause_all_vcpus(); cpu_synchronize_all_states(); CPU_FOREACH(cs) { X86CPU *cpu = X86_CPU(cs); CPUX86State *env = &cpu->env; if (env->smbase == 0x30000 && env->eip == 0xfff0) { CPUClass *k = CPU_GET_CLASS(cs); uint64_t cpu_arch_id = k->get_arch_id(cs); trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); continue; } cpu_interrupt(cs, CPU_INTERRUPT_SMI); } resume_all_vcpus(); } then I see the following symptoms: - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at 100% - the boot process, as observed from the OVMF log, is just as slow (= crawling) as without the VCPU pausing/resuming (i.e., like with cpu_synchronize_all_states() only); so ultimately the pausing/resuming doesn't help. Thanks Laszlo