On 10/21/15 14:10, Laszlo Ersek wrote:
> On 10/15/15 00:25, Laszlo Ersek wrote:
> 
>> Test environment and results:
>>
>> Host kernel:
>> - latest RHEL-7 development kernel (3.10.0-323.el7), with Paolo's
>>   following patches backported by yours truly:
>>   - KVM: x86: clean up kvm_arch_vcpu_runnable
>>   - KVM: x86: fix SMI to halted VCPU
>>
>> QEMU:
>> - current upstream (c49d3411faae), with Paolo's patch applied:
>>   - target-i386: allow any alignment for SMBASE
>>
>> Below, the meaning of "bitness=32" is:
>> * qemu-system-i386
>> * -cpu coreduo,-nx
>>
>> Whereas "bitness=64" means:
>> * qemu-system-x86_64
>> * no special -cpu flag
>>
>> For variable access verification, "efibootmgr" is invoked (without
>> options) at the guest OS (Fedlet 20141209) root prompt.
>>
>>   bitness  accel  VCPUs  result
>>   -------  -----  -----  -----------------------------------------------
>>   32       KVM    1      Fedlet 20141209 boots, S3 works, variables work
>>
>>   32       KVM    2      stuck in SMBASE relocation, APIC IDs look valid
> 
> Alright, so I've dug into this. It's very interesting.
> 
> First, here's the debug patch for edk2:
> 
> -------------
> diff --git a/UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c 
> b/UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c
> index 0e39173..bcfa075 100644
> --- a/UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c
> +++ b/UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c
> @@ -442,11 +442,15 @@ SmmRelocateBases (
>    for (Index = 0; Index < mNumberOfCpus; Index++) {
>      mRebased[Index] = FALSE;
>      if (ApicId != (UINT32)gSmmCpuPrivate->ProcessorInfo[Index].ProcessorId) {
> +      DEBUG ((EFI_D_VERBOSE, "%a: sending SMI IPI to APIC ID 0x%Lx\n",
> +        __FUNCTION__, gSmmCpuPrivate->ProcessorInfo[Index].ProcessorId));
>        SendSmiIpi ((UINT32)gSmmCpuPrivate->ProcessorInfo[Index].ProcessorId);
>        //
>        // Wait for this AP to finish its 1st SMI
>        //
>        while (!mRebased[Index]);
> +      DEBUG ((EFI_D_VERBOSE, "%a: APIC ID 0x%Lx has processed its first 
> SMI\n",
> +        __FUNCTION__, gSmmCpuPrivate->ProcessorInfo[Index].ProcessorId));
>      } else {
>        //
>        // BSP will be Relocated later
> -------------
> 
> As one can expect, the first message appears in the log:
> 
> ------------
> SMRAM TileSize = 00000800
> CPU[000]  APIC ID=0000  SMBASE=7FFC1000  SaveState=7FFD0C00  Size=00000400
> CPU[001]  APIC ID=0001  SMBASE=7FFC1800  SaveState=7FFD1400  Size=00000400
> SmmRelocateBases: sending SMI IPI to APIC ID 0x1
> ------------
> 
> but the second message doesn't; the (!mRebased[Index]) condition
> never evaluates to false, so the loop is never exited.
> 
> Second, I sought to analyze the KVM trace very carefully, against the
> SendSmiIpi() source code in edk2, and against the KVM source code.
> Here comes the kicker: KVM interprets the APIC ICR (high, low) writes
> correctly, injects the SMI, VCPU#1 wakes and enters SMM (!), then
> leaves SMM with a relocated SMBASE field (!!!).
> 
> *However*, according to the KVM trace, the relocated SMBASE field is
> *wrong* -- the value being reported below, 0x7ffc1000, corresponds to
> CPU#0 above!
> 
> ------------
>  qemu-system-i38-22085 [000] 13634.057590: kvm_enter_smm:        vcpu 1: 
> leaving SMM, smbase 0x7ffc1000
> ------------
> 
> Then VCPU#1 goes on to do various things (I'm too lazy to analyze all
> those trace entries), but ultimately it reaches a HLT. And the busy
> wait in SmmRelocateBases() never completes, because vcpu #1 seems to
> have looked at VCPU#0's area.
> 
> Given that this works with TCG, I *guess* it is either a KVM bug, or
> some visibility race. I'll have to look at more.

I added more debug messages to edk2, and compared the logs generated
when running on KVM against those generated when running on TCG. The
point where things go sideways in edk2 is this:

The SmmInitHandler() function
[UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c] has the following leading
comment:

"C function for SMI handler. To change all processor's SMMBase Register."

It is called from the assembly language entry point code, in System
Management Mode, for the purposes of initial SMBASE relocation.

In this function, the processor that executes the function identifies
itself by calling GetApicId(). The retrieved APIC ID is located within
the ProcessorInfo array (in order to map the APIC ID to a CPU index),
and then the rest of the relocation code uses this CPU index, for poking
into the SMBASE and SaveState "tiles".

Now, for VCPU#1, GetApicId() returns APIC ID 0x01 on TCG, and 0x00 on
KVM. Things go downhill from there.

The GetApicId() function
[UefiCpuPkg/Library/BaseXApicLib/BaseXApicLib.c] does two (interesting)
things: it checks the initial APIC ID first (with a CPUID instruction),
then -- because the former is under 0x100 -- it does:

    //
    // If the initial local APIC ID is less 0x100, read APIC ID from
    // XAPIC_ID_OFFSET, otherwise return the initial local APIC ID.
    //
    ApicId = ReadLocalApicReg (XAPIC_ID_OFFSET);
    ApicId >>= 24;

Clearly this condition holds for both TCG and KVM, so the difference is
within

  ReadLocalApicReg (XAPIC_ID_OFFSET)

(that register is reprogrammable).

I made an effort to track the path on which the APIC ID register is set
in QEMU. It's a dizzying chain of callbacks and property setters that
bifurcates at some point between TCG and KVM. The common part seems to be

pc_new_cpu() [hw/i386/pc.c]
  object_property_set_int("apic-id")
    x86_cpuid_set_apic_id() [target-i386/cpu.c]
      // sets cpu->apic_id

x86_cpu_realizefn() [target-i386/cpu.c]
  x86_cpu_apic_create()
    // if running on TCG:
    // create a device of type "apic", and:
    qdev_prop_set_uint8(cpu->apic_state, "id", cpu->apic_id);

    // if running on KVM:
    // create a device of type "kvm-apic", and do the same

Now, on TCG, reading the APIC ID register (for device "apic") happens in:

apic_mem_readl() [hw/intc/apic.c]
  val = s->id << 24

Whereas on KVM, the same occurs in:

kvm_apic_mem_read() [hw/i386/kvm/apic.c]
  return ~(uint64_t)0;

However, such reads don't seem to reach QEMU (and the above read stub);
they are handled within KVM (I don't know where the distinction is made
in KVM).

(Side note: I tried kernel_irqchip=off as well, but it drives QEMU into
such a loop that I cannot even stop QEMU with SIGTERM -- only SIGKILL
works.)

Also, KVM-specific initialization exists too:

kvm_init_vcpu() [kvm-all.c]
  kvm_arch_vcpu_id() [target-i386/kvm.c]
    // notice, we're using cpu->apic_id
    return cpu->apic_id

  kvm_vm_ioctl(KVM_CREATE_VCPU, ...)

In other words, the initial KVM_CREATE_VCPU ioctl does get the right
APIC ID. In the kernel,

kvm_vm_ioctl() [virt/kvm/kvm_main.c]
  kvm_arch_vcpu_create() [arch/x86/kvm/x86.c]
    vmx_create_vcpu() [arch/x86/kvm/vmx.c]
      kvm_vcpu_init() [virt/kvm/kvm_main.c]
        vcpu->vcpu_id = id;

        kvm_arch_vcpu_init() [arch/x86/kvm/x86.c]
          kvm_create_lapic() [arch/x86/kvm/lapic.c]
            kvm_lapic_reset()
              kvm_apic_set_id(apic, vcpu->vcpu_id)
                apic_set_reg(APIC_ID)
                  *((u32 *) (apic->regs + reg_off)) = val;

So, wherever KVM decides not to hand off the later APIC ID reads to
userspace, it *could* provide the right APIC ID, but it doesn't, in
spite of the following read call tree:

apic_mmio_read() [arch/x86/kvm/lapic.c]
  apic_reg_read()
    __apic_read()
      kvm_apic_id()
        kvm_apic_get_reg(APIC_ID) [arch/x86/kvm/lapic.h]
         return *((u32 *) (apic->regs + reg_off));

We have initialization and read paths that look correct, why doesn't it
work? I'm thinking because of problems with reset. Namely, the following
upstream kernel commit has not been backported to RHEL-7 yet (as of
3.10.0-325.el7):

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d28bc9dd25ce

Note the following:

> "The state of the local APIC following an INIT reset is the same as it
> is after a power-up or hardware reset, except that the APIC ID and
> arbitration ID registers are not affected." [10.4.7.3: Local APIC
> State After an INIT Reset ("Wait-for-SIPI" State)]

This upstream kernel commit makes the APIC ID register survive untouched
on INIT. Still, without that patch in place, the APIC ID register should
be re-set to its *initial* value, which should be correct...

Hm... Maybe scratch all of the above? Looking more closely at the trace:

---------
kvm_enter_smm:    vcpu 1: entering SMM, smbase 0x30000
...
kvm_entry:        vcpu 1
kvm_exit:         reason EPT_VIOLATION rip 0x7ffdcf46 info 181 0
kvm_page_fault:   address fee00020 error_code 181
kvm_exit:         reason EXTERNAL_INTERRUPT rip 0x7ffd2df1 info 0
                    800000ef
...
kvm_enter_smm:    vcpu 1: leaving SMM, smbase 0x7ffc1000
...
kvm_entry:        vcpu 1
kvm_exit:         reason APIC_ACCESS rip 0x7fd8cd09 info 20 0
kvm_emulate_insn: 0:7fd8cd09: 8b 1b
kvm_apic:         apic_read APIC_ID = 0x1000000
kvm_mmio:         mmio read len 4 gpa 0xfee00020 val 0x1000000
kvm_entry:        vcpu 1
---------

It looks like the APIC_ID read (at mmio 0xfee00020)  *within* SMM fails
with EPT_VIOLATION, without apic_read actually happening. However, soon
after SMM is left, a similar APIC ID read occurs, and *that* succeeds.

... I disabled EPT, but the issue persists... :(

Thanks
Laszlo
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to