lpc_ich9: add broadcast SMI feature

Laszlo Ersek Wed, 18 Jan 2017 02:22:31 -0800

On 01/18/17 11:03, Igor Mammedov wrote:
> On Tue, 17 Jan 2017 19:53:21 +0100
> Laszlo Ersek <ler...@redhat.com> wrote:
> 
>> On 01/17/17 15:20, Igor Mammedov wrote:
>>> On Tue, 17 Jan 2017 14:46:05 +0100
>>> Laszlo Ersek <ler...@redhat.com> wrote:
>>>   
>>>> On 01/17/17 14:20, Igor Mammedov wrote:  
>>>>> On Tue, 17 Jan 2017 13:06:13 +0100
>>>>> Laszlo Ersek <ler...@redhat.com> wrote:
>>>>>     
>>>>>> On 01/17/17 12:21, Igor Mammedov wrote:    
>>>>>>> On Mon, 16 Jan 2017 21:46:55 +0100
>>>>>>> Laszlo Ersek <ler...@redhat.com> wrote:
>>>>>>>       
>>>>>>>> On 01/13/17 14:09, Igor Mammedov wrote:      
>>>>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100
>>>>>>>>> Laszlo Ersek <ler...@redhat.com> wrote:
>>>>>>>>>         
>>>>>>>>>> The generic edk2 SMM infrastructure prefers
>>>>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each 
>>>>>>>>>> processor. If
>>>>>>>>>> Trigger() only brings the current processor into SMM, then edk2 
>>>>>>>>>> handles it
>>>>>>>>>> in the following ways:
>>>>>>>>>>
>>>>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
>>>>>>>>>>     ExitBootServices(), but is not necessarily true at runtime), 
>>>>>>>>>> then:
>>>>>>>>>>
>>>>>>>>>>     (a) If edk2 has been configured for "traditional" SMM 
>>>>>>>>>> synchronization,
>>>>>>>>>>         then the BSP sends directed SMIs to the APs with APIC 
>>>>>>>>>> delivery,
>>>>>>>>>>         bringing them into SMM individually. Then the BSP runs the 
>>>>>>>>>> SMI
>>>>>>>>>>         handler / dispatcher.
>>>>>>>>>>
>>>>>>>>>>     (b) If edk2 has been configured for "relaxed" SMM 
>>>>>>>>>> synchronization,
>>>>>>>>>>         then the APs that are not already in SMM are not brought in, 
>>>>>>>>>> and
>>>>>>>>>>         the BSP runs the SMI handler / dispatcher.
>>>>>>>>>>
>>>>>>>>>> (2) If Trigger() is executed by an AP (which is possible after
>>>>>>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>>>>>>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
>>>>>>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
>>>>>>>>>>
>>>>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
>>>>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
>>>>>>>>>> command from (2) can take more than 3 seconds to complete, because
>>>>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
>>>>>>>>>>
>>>>>>>>>> The larger problem is that QEMU's current behavior diverges from the
>>>>>>>>>> behavior usually seen on physical hardware, and that keeps exposing
>>>>>>>>>> obscure corner cases, race conditions and other instabilities in 
>>>>>>>>>> edk2,
>>>>>>>>>> which generally expects / prefers a software SMI to affect all CPUs 
>>>>>>>>>> at
>>>>>>>>>> once.
>>>>>>>>>>
>>>>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to 
>>>>>>>>>> inject
>>>>>>>>>> the SMI on all VCPUs.
>>>>>>>>>>
>>>>>>>>>> While the original posting of this patch
>>>>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
>>>>>>>>>> only intended to speed up (2), based on our recent "stress testing" 
>>>>>>>>>> of SMM
>>>>>>>>>> this patch actually provides functional improvements.
>>>>>>>>>>
>>>>>>>>>> Cc: "Michael S. Tsirkin" <m...@redhat.com>
>>>>>>>>>> Cc: Gerd Hoffmann <kra...@redhat.com>
>>>>>>>>>> Cc: Igor Mammedov <imamm...@redhat.com>
>>>>>>>>>> Cc: Paolo Bonzini <pbonz...@redhat.com>
>>>>>>>>>> Signed-off-by: Laszlo Ersek <ler...@redhat.com>
>>>>>>>>>> Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> Notes:
>>>>>>>>>>     v6:
>>>>>>>>>>     - no changes, pick up Michael's R-b
>>>>>>>>>>     
>>>>>>>>>>     v5:
>>>>>>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>>>>>>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>>>>>>>>>>       DEFINE_PROP_BIT() in the next patch)
>>>>>>>>>>
>>>>>>>>>>  include/hw/i386/ich9.h |  3 +++
>>>>>>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>>>>>>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>>>>>>>>>> index da1118727146..18dcca7ebcbf 100644
>>>>>>>>>> --- a/include/hw/i386/ich9.h
>>>>>>>>>> +++ b/include/hw/i386/ich9.h
>>>>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>>>>>>>>>>  #define ICH9_SMB_HST_D1                         0x06
>>>>>>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>>>>>>>>>>  
>>>>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
>>>>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>>>>>>>>>> +
>>>>>>>>>>  #endif /* HW_ICH9_H */
>>>>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>>>>>>>>>> index 376b7801a42c..ced6f803a4f2 100644
>>>>>>>>>> --- a/hw/isa/lpc_ich9.c
>>>>>>>>>> +++ b/hw/isa/lpc_ich9.c
>>>>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, 
>>>>>>>>>> void *arg)
>>>>>>>>>>  
>>>>>>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>>>>>>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
>>>>>>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>>>>>>>> +        if (lpc->smi_negotiated_features &
>>>>>>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
>>>>>>>>>> +            CPUState *cs;
>>>>>>>>>> +            CPU_FOREACH(cs) {
>>>>>>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>>>>>> +            }        
>>>>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?      
>>>>>>>>>   
>>>>>>>>
>>>>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry 
>>>>>>>> code
>>>>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
>>>>>>>> code area for execution. The VCPUs are told apart from each other by
>>>>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or
>>>>>>>> search criterion into a global shared array. Once they find their
>>>>>>>> respective private data blocks, the VCPUs don't step on each other's 
>>>>>>>> toes.      
>>>>>>> That's what I'm not sure.
>>>>>>> If broadcast SMI is sent before relocation all CPUs will use
>>>>>>> the same SMBASE and as result save their state in the same
>>>>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting
>>>>>>> each other's saved state      
>>>>>>
>>>>>> Good point!
>>>>>>    
>>>>>>> and then on exit from SMI they all will restore
>>>>>>> whatever state that race produced so it seems plain wrong thing to do.
>>>>>>>
>>>>>>> Are you sure that edk2 does broadcast for SMI initialization or does it
>>>>>>> using directed SMIs?      
>>>>>>
>>>>>> Hmmm, you are right. The SmmRelocateBases() function in
>>>>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
>>>>>> each individual AP in succession, by sending SMI IPIs to them, one by
>>>>>> one. Then it does the same for the BSP, by sending itself a similar SMI 
>>>>>> IPI.
>>>>>>
>>>>>> The above QEMU code is only exercised much later, after the SMBASE
>>>>>> relocation, with the SMM stack up and running. It is used when a
>>>>>> (preferably broadcast) SMI is triggered for firmware services (for
>>>>>> example, the UEFI variable services).
>>>>>>
>>>>>> So, you are right.
>>>>>>
>>>>>> Considering edk2 only, the difference shouldn't matter -- when this code
>>>>>> is invoked (via an IO port write), the SMBASEs will all have been 
>>>>>> relocated.
>>>>>>
>>>>>> Also, I've been informed that on real hardware, writing to APM_CNT
>>>>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
>>>>>> So I believe the above code should be closest to real hardware, and the
>>>>>> pre-patch code was only written in unicast form for SeaBIOS's sake.
>>>>>>
>>>>>> I think the code is safe above. If the guest injects an SMI via APM_CNT
>>>>>> after negotiating SMI broadcast, it should have not left any VCPUs
>>>>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In
>>>>>> the future we can tweak this further, if necessary; we have 63 more
>>>>>> bits, and we'll be able to reject invalid combinations of bits.
>>>>>>
>>>>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed
>>>>>> from the default?
>>>>>>
>>>>>> If so, doesn't that run the risk of missing a VCPU that has an actual
>>>>>> SMI handler installed, and it just happens to be placed at the default
>>>>>> SMBASE location?
>>>>>>
>>>>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
>>>>>> but I think (a) that's not what real hardware does, and (b) it is risky
>>>>>> if a VCPU's actual SMI handler has been installed while keeping the
>>>>>> default SMBASE value.
>>>>>>
>>>>>> What do you think?    
>>>>> (a) probably doesn't consider hotplug, so it's totally fine for the case
>>>>> where firmware is the only one who wakes up/initializes CPUs.
>>>>> however consider 2 CPU hotplugged and then broadcast SMI happens
>>>>> to serve some other task (if there isn't unpark 'feature').
>>>>> Even if real hw does it, it's behavior not defined by SDM with possibly
>>>>> bad consequences.
>>>>>
>>>>> (b) alone it's risky so I'd skip based on combination of
>>>>>  
>>>>>  if (default SMBASE & CPU is in reset state)
>>>>>     skip;
>>>>>
>>>>> that should be safe and it leaves open possibility to avoid adding
>>>>> unpark 'feature' to CPU.    
>>>>
>>>> The above attributes ("SMBASE" and "CPU is in reset state") are specific
>>>> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
>>>> X86_CPU() macro?
>>>>
>>>> Can I assume here that the macro will never return NULL? (I think so,
>>>> LPC is only used with x86.)  
>>> yep, that should work.
>>>   
>>>>
>>>> ... I guess SMBASE can be accessed as
>>>>
>>>>   X86_CPU(cs)->env.smbase  
>>> preferred style:
>>>     X86CPU *cpu = X86_CPU(cs);
>>>     cpu->...
>>>   
>>>>
>>>> How do I figure out if the CPU is in reset state ("waiting for first
>>>> INIT") though? Note that this state should be distinguished from simply
>>>> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected
>>>> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
>>>> APs to sleep.  
>>>
>>> how about using EIP reset value?
>>> x86_cpu_reset()
>>>    ...
>>>    env->eip = 0xfff0;  
>>
>> Okay, so I tried this. It doesn't work.
>>
>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
>> function from ich9_apm_ctrl_changed(), due to size reasons):
>>
>>> static void ich9_apm_broadcast_smi(void)
>>> {
>>>     CPUState *cs;
>>>
>>>     cpu_synchronize_all_states(); /* <--------- mark this */
> it probably should be executed after pause_all_vcpus(),
> maybe it will help.
> 
>>>     CPU_FOREACH(cs) {
>>>         X86CPU *cpu = X86_CPU(cs);
>>>         CPUX86State *env = &cpu->env;
>>>
>>>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>>>             CPUClass *k = CPU_GET_CLASS(cs);
>>>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>>>
>>>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>>>             continue;
>>>         }
>>>
>>>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>     }
>>> }  
>>
> [...]
>> (b) If I add the cpu_synchronize_all_states() call before the loop, then
>> the incorrect filter matches go away; QEMU sees the KVM VCPU states
>> correctly, and the SMI is broad-cast.
>>
>> However, in this case, the boot slows to a *crawl*. It's unbearable. All
>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
>> with the naked eye, almost line by line.
>> I didn't expect that cpu_synchronize_all_states() would be so costly,
>> but it makes the idea of checking VCPU state in the APM_CNT handler a
>> non-starter.
> I wonder were bottleneck is to slow down guest so much.
> What is actual cost of calling cpu_synchronize_all_states()?
> 
> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
> would help.


If I prepend just a pause_all_vcpus() function call, at the top, then
the VM freezes (quite understandably) when the first SMI is raised via
APM_CNT.

If I, instead, bracket the function body with pause_all_vcpus() and
resume_all_vcpus(), like this:

static void ich9_apm_broadcast_smi(void)
{
    CPUState *cs;

    pause_all_vcpus();
    cpu_synchronize_all_states();
    CPU_FOREACH(cs) {
        X86CPU *cpu = X86_CPU(cs);
        CPUX86State *env = &cpu->env;

        if (env->smbase == 0x30000 && env->eip == 0xfff0) {
            CPUClass *k = CPU_GET_CLASS(cs);
            uint64_t cpu_arch_id = k->get_arch_id(cs);

            trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
            continue;
        }

        cpu_interrupt(cs, CPU_INTERRUPT_SMI);
    }
    resume_all_vcpus();
}

then I see the following symptoms:
- rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at
  100%
- the boot process, as observed from the OVMF log, is just as slow
  (= crawling) as without the VCPU pausing/resuming (i.e., like with
  cpu_synchronize_all_states() only); so ultimately the
  pausing/resuming doesn't help.

Thanks
Laszlo

Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature

Reply via email to