On 7/5/19 4:04 PM, Jan Kiszka wrote:
> On 05.07.19 15:54, Ralf Ramsauer wrote:
>>
>>
>> On 7/5/19 2:36 PM, Jan Kiszka wrote:
>>> On 05.07.19 14:34, Ralf Ramsauer wrote:
>>>>
>>>>
>>>> On 7/5/19 8:55 AM, Jan Kiszka wrote:
>>>>> On 04.07.19 22:56, Ralf Ramsauer wrote:
>>>>>> On 7/4/19 5:24 PM, Jan Kiszka wrote:
>>>>>>> On 04.07.19 17:18, Ralf Ramsauer wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/4/19 4:39 PM, Jan Kiszka wrote:
>>>>>>>>> On 04.07.19 15:43, Ralf Ramsauer wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> we have some trouble starting non-root Linux on an AMD box. I
>>>>>>>>>> already
>>>>>>>>>> tried to narrow things down, but it raised several questions.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The main problem is, that non-root Linux tries to write to LVT0,
>>>>>>>>>> and
>>>>>>>>>> jailhouse crashes with:
>>>>>>>>>>
>>>>>>>>>> FATAL: Setting invalid LVT delivery mode (reg 35, value
>>>>>>>>>> 00000700)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Turns out, in comparison to Intel x86, we don't trap on APIC
>>>>>>>>>> reads, we
>>>>>>>>>> only intercept APIC write on AMD (cf. svm.c:338). I thought this
>>>>>>>>>> would
>>>>>>>>>> be the issue of this bug, as that's an obvious difference between
>>>>>>>>>> Intel
>>>>>>>>>> and AMD: on VMX, we do trap xAPIC reads and writes. However, VMX
>>>>>>>>>> works
>>>>>>>>>> slightly different in these regards (side note: [1]).
>>>>>>>>>>
>>>>>>>>>> xAPIC reads on AMD systems don't trap the hypervisor, so I
>>>>>>>>>> intercepted
>>>>>>>>>> reads (by removing the present bit of the XAPIC_PAGE of the
>>>>>>>>>> guest), and
>>>>>>>>>> forwarded the traps to the apic dispatcher (adjusted VMEXIT_NPF).
>>>>>>>>>>
>>>>>>>>>> I can confirm that we now trap reads as well as writes. But the
>>>>>>>>>> non-root
>>>>>>>>>> Linux still crashes with the same error.
>>>>>>>>>>
>>>>>>>>>> Digging a bit deeper, I found out that xAPIC reads are directly
>>>>>>>>>> forwarded to the hardware, if they were intercepted. So this
>>>>>>>>>> explains
>>>>>>>>>> why the bug still remains. This raised another question regarding
>>>>>>>>>> xAPIC
>>>>>>>>>> handling on Intel:
>>>>>>>>>>
>>>>>>>>>> On AMD, we don't intercept xAPIC reads. On Intel, we do,
>>>>>>>>>> as we
>>>>>>>>>> follow the strategy mentioned in [1]… But why?
>>>>>>>>>
>>>>>>>>> It accelerates write dispatching at least. I never did the
>>>>>>>>> comparison
>>>>>>>>> if> using a different access scheme would be beneficial because
>>>>>>>>> xAPIC is
>>>>>>>>> practically dead on Intel.
>>>>>>>>
>>>>>>>> Hmm... The change and benchmark should be pretty easy. Once a
>>>>>>>> bunch of
>>>>>>>> other issues is solved, I'll maybe have a look at this.
>>>>>>>>
>>>>>>>
>>>>>>> As I said: you will optimize a legacy code path, not practically
>>>>>>> relevant. If that will simplify the code, though, I might still be
>>>>>>> interested :).
>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Wouldn't it be more performant to just trap on xAPIC
>>>>>>>>>> writes on
>>>>>>>>>> Intel? This could be done by switching from APIC_ACCESS
>>>>>>>>>> interception
>>>>>>>>>> to simple write-only trap & emulate (page faults).
>>>>>>>>>>
>>>>>>>>>> However, back to the initial issue. Looks like the difference
>>>>>>>>>> between
>>>>>>>>>> Intel and AMD boot is as follows.
>>>>>>>>>>
>>>>>>>>>> AMD:
>>>>>>>>>> [ 0.325578] Switched APIC routing to physical flat.
>>>>>>>>>> [ 0.366464] enabled ExtINT on CPU#0
>>>>>>>>>>
>>>>>>>>>> Intel:
>>>>>>>>>> [ 0.099486] Switched APIC routing to physical flat.
>>>>>>>>>> [ 0.113000] masked ExtINT on CPU#0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is why the above-mentioned Jailhouse crash occurs. I
>>>>>>>>>> tried to
>>>>>>>>>> find
>>>>>>>>>> out why Linux takes this decision on AMD. Our victim is in
>>>>>>>>>> apic.c:1587.
>>>>>>>>>>
>>>>>>>>>> On Intel, apic_read(LVT0) & APIC_LVT_MASKED results in 65536, on
>>>>>>>>>> AMD it
>>>>>>>>>> is 0. This is why we take a different path.
>>>>>>>>>>
>>>>>>>>>> Now the question is simple -- why? :-)
>>>>>>>>>>
>>>>>>>>>> Are we just lacking ExtINT delivery mode in Jailhouse, or is
>>>>>>>>>> anything
>>>>>>>>>> else odd?
>>>>>>>>>
>>>>>>>>> Yes, the ExtINT makes no sense for secondary cells, and it should
>>>>>>>>> also
>>>>>>>>> not be needed for primary ones. Let's dig deeper:
>>>>>>>>>
>>>>>>>>> value = apic_read(APIC_LVT0) & APIC_LVT_MASKED;
>>>>>>>>> if (!cpu && (pic_mode || !value || skip_ioapic_setup)) {
>>>>>>>>> value = APIC_DM_EXTINT;
>>>>>>>>> apic_printk(APIC_VERBOSE, "enabled ExtINT on CPU#%d\n",
>>>>>>>>> cpu);
>>>>>>>>>
>>>>>>>>> What are the values here, and which are different?
>>>>>>>>
>>>>>>>> As already mentioned above, only value differs:
>>>>>>>>
>>>>>>>>>> On Intel, apic_read(LVT0) & APIC_LVT_MASKED results in 65536, on
>>>>>>>>>> AMD
>>>>>>>>>> it is 0. This is why we take a different path.
>>>>>>>>
>>>>>>>> cpu, pic_mode and skip_ioapic_setup is 0 on both machines.
>>>>>>>
>>>>>>> Ah, ok. Then you need to find the evil guy unmasking LVT0 before
>>>>>>> that.
>>>>>>> Can't be Jailhouse: we hand it over masked.
>>>>>>
>>>>>> Yes, I checked this. Actually we do. But...
>>>>>>
>>>>>> When the cell is created after jailhouse is enabled, apic_clear()
>>>>>> will
>>>>>> be called when the SIPI is received. There, I added some
>>>>>> instrumentation. At that moment, LVT0 holds (and keeps) 0x10000.
>>>>>>
>>>>>> In addition to that, I instrumented the linux-loader. There, I read
>>>>>> back
>>>>>> LVT0. Very early, before we hand over to Linux. No one else touches
>>>>>> LVT0
>>>>>> in the meanwhile. I would see any other guest access as interceptions
>>>>>> are instrumented (both, read and write).
>>>>>>
>>>>>> So in the linux-loader, the read back causes a vmexit, and I read
>>>>>> back
>>>>>> 0x0. That's really strange, there is - afaict - no other access
>>>>>> in the
>>>>>> meanwhile.
>>>>>>
>>>>>> I don't know what's going on there. I don't see any other
>>>>>> modifications
>>>>>> of LVT registers in code paths other than apic_clear().
>>>>>
>>>>> Maybe you can lift the setup into KVM and check if you can reproduce
>>>>> there as well. That will allow to track down the other access that
>>>>> does
>>>>> the enabling. It shouldn't be possible that the hardware does that on
>>>>> its own.
>>>>
>>>> Tried to run Jailhouse on QEMU on a AMD machine with nested KVM.
>>>>
>>>> I currently see no way to test this on qemu, as Jailhouse seems to be
>>>> pretty unstable. We horribly crash in many situations on kvm:
>>>>
>>>> - High chance of freezes when enabling jailhouse
>>>> - I loose devices if I don't reroute interrupts to CPU0 before I
>>>> create cells
>>>> - cell destroy doesn't work. We freeze and after a while:
>>>> "Ignoring NMI
>>>> IPI to CPU 1"
>>>> - Starting causes exceptions inside jailhouse
>>>>
>>>> So Jailhouse definitely runs more stable on bare-metal than on
>>>> qemu/SVM.
>>>> I need to find another way to debug this.
>>>
>>> OK...
>>>
>>> Next strategy: Frequent read-back and validation of the APIC state. That
>>> may help to narrow down the point where the bit flips. Make sure you
>>> read on the right CPU, tough.
>>
>> Finally…
>>
>> I found the evil guy. It's inside apic_clear. The last call to the xapic:
>>
>> apic.c @ apic_clear
>> /* Finally, reset the TPR again and disable the APIC */
>> apic_ops.write(APIC_REG_TPR, 0);
>> apic_ops.write(APIC_REG_SVR, 0xff);
>>
>> Disabling the xAPIC via APIC_REG_SVR will reset LVT0 and others to zero.
>
> What?!?
Yep.
That's my instrumentation:
diff --git a/hypervisor/arch/x86/apic.c b/hypervisor/arch/x86/apic.c
index 7f51b062..d88ee237 100644
--- a/hypervisor/arch/x86/apic.c
+++ b/hypervisor/arch/x86/apic.c
@@ -340,7 +340,12 @@ void apic_clear(void)
/* Finally, reset the TPR again and disable the APIC */
apic_ops.write(APIC_REG_TPR, 0);
- apic_ops.write(APIC_REG_SVR, 0xff);
+
+ printk("Before disabling: %x\n", apic_ops.read(APIC_REG_LVT0));
+ apic_ops.write(APIC_REG_SVR, 0xff);
+ printk("After disabling: %x\n", apic_ops.read(APIC_REG_LVT0));
+ apic_ops.write(APIC_REG_SVR, APIC_SVR_ENABLE_APIC | 0xff);
+ printk("After reenabling: %x\n", apic_ops.read(APIC_REG_LVT0));
}
static bool apic_valid_ipi_mode(u32 lo_val)
And here is jailhouse output:
Created cell "linux-x86-demo"
Page pool usage after cell creation: mem 280/979, remap 16519/131072
Cell "linux-x86-demo" can be loaded
CPU 3 received SIPI, vector 100
Started cell "linux-x86-demo"
Before disabling: 10000
After disabling: 0
After reenabling: 0
[...]
>
> "The ASE bit when set to 0 disables the local APIC temporarily. When the
> local APIC is disabled, SMI, NMI, INIT, Startup, Remote Read, and LINT
> interrupts may be accepted; pending interrupts in the ISR and IRR are
> held, but further fixed, lowest-priority, and ExtInt interrupts are not
> accepted. All LVT entry mask bits are set and cannot be cleared."
>
> If that is not true for your hardware, it does not conform to its own spec.
What can I say, it's not the first time that hardware doesn't conform to
specs.
>
>> Seems we must not disable the xAPIC and hand it over enabled, because
>> then guest will read erroneous initial values.
>>
>> Commenting out the last line solves the issue. We can now even boot on
>> multiple CPUs - everything seems to be fine so far.
>>
>> At least on our machine,
>> - disabling the xAPIC clears LVT* registers
>> - any write to LVT* will be ignored as long as SVR is disabled
>> - If SVR is re-enabled, LVT* will still hold 0, whatever was written
>> to it before it was disabled
>
> What a mess. The problem is we try to emulate the specified reset state
> of the APIC here. And this is SVT = 0xff, LVT = masked.
A quickfix on my side is to hand over the xAPIC enabled.
A proper fix requires to be sensitive on reads on registers while the
apic is disabled. Horrible...
Ralf
>
> Jan
>
--
You received this message because you are subscribed to the Google Groups
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/jailhouse-dev/f2ea45c4-1c22-77cc-a6dc-831aa1e27a3c%40oth-regensburg.de.
For more options, visit https://groups.google.com/d/optout.