On 7/5/19 2:36 PM, Jan Kiszka wrote:
> On 05.07.19 14:34, Ralf Ramsauer wrote:
>>
>>
>> On 7/5/19 8:55 AM, Jan Kiszka wrote:
>>> On 04.07.19 22:56, Ralf Ramsauer wrote:
>>>> On 7/4/19 5:24 PM, Jan Kiszka wrote:
>>>>> On 04.07.19 17:18, Ralf Ramsauer wrote:
>>>>>>
>>>>>>
>>>>>> On 7/4/19 4:39 PM, Jan Kiszka wrote:
>>>>>>> On 04.07.19 15:43, Ralf Ramsauer wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> we have some trouble starting non-root Linux on an AMD box. I
>>>>>>>> already
>>>>>>>> tried to narrow things down, but it raised several questions.
>>>>>>>>
>>>>>>>>
>>>>>>>> The main problem is, that non-root Linux tries to write to LVT0,
>>>>>>>> and
>>>>>>>> jailhouse crashes with:
>>>>>>>>
>>>>>>>>       FATAL: Setting invalid LVT delivery mode (reg 35, value
>>>>>>>> 00000700)
>>>>>>>>
>>>>>>>>
>>>>>>>> Turns out, in comparison to Intel x86, we don't trap on APIC
>>>>>>>> reads, we
>>>>>>>> only intercept APIC write on AMD (cf. svm.c:338). I thought this
>>>>>>>> would
>>>>>>>> be the issue of this bug, as that's an obvious difference between
>>>>>>>> Intel
>>>>>>>> and AMD: on VMX, we do trap xAPIC reads and writes. However, VMX
>>>>>>>> works
>>>>>>>> slightly different in these regards (side note: [1]).
>>>>>>>>
>>>>>>>> xAPIC reads on AMD systems don't trap the hypervisor, so I
>>>>>>>> intercepted
>>>>>>>> reads (by removing the present bit of the XAPIC_PAGE of the
>>>>>>>> guest), and
>>>>>>>> forwarded the traps to the apic dispatcher (adjusted VMEXIT_NPF).
>>>>>>>>
>>>>>>>> I can confirm that we now trap reads as well as writes. But the
>>>>>>>> non-root
>>>>>>>> Linux still crashes with the same error.
>>>>>>>>
>>>>>>>> Digging a bit deeper, I found out that xAPIC reads are directly
>>>>>>>> forwarded to the hardware, if they were intercepted. So this
>>>>>>>> explains
>>>>>>>> why the bug still remains. This raised another question regarding
>>>>>>>> xAPIC
>>>>>>>> handling on Intel:
>>>>>>>>
>>>>>>>>       On AMD, we don't intercept xAPIC reads. On Intel, we do,
>>>>>>>> as we
>>>>>>>>       follow the strategy mentioned in [1]… But why?
>>>>>>>
>>>>>>> It accelerates write dispatching at least. I never did the
>>>>>>> comparison
>>>>>>> if> using a different access scheme would be beneficial because
>>>>>>> xAPIC is
>>>>>>> practically dead on Intel.
>>>>>>
>>>>>> Hmm... The change and benchmark should be pretty easy. Once a
>>>>>> bunch of
>>>>>> other issues is solved, I'll maybe have a look at this.
>>>>>>
>>>>>
>>>>> As I said: you will optimize a legacy code path, not practically
>>>>> relevant. If that will simplify the code, though, I might still be
>>>>> interested :).
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>       Wouldn't it be more performant to just trap on xAPIC
>>>>>>>> writes on
>>>>>>>>       Intel? This could be done by switching from APIC_ACCESS
>>>>>>>> interception
>>>>>>>>       to simple write-only trap & emulate (page faults).
>>>>>>>>
>>>>>>>> However, back to the initial issue. Looks like the difference
>>>>>>>> between
>>>>>>>> Intel and AMD boot is as follows.
>>>>>>>>
>>>>>>>> AMD:
>>>>>>>> [    0.325578] Switched APIC routing to physical flat.
>>>>>>>> [    0.366464] enabled ExtINT on CPU#0
>>>>>>>>
>>>>>>>> Intel:
>>>>>>>> [    0.099486] Switched APIC routing to physical flat.
>>>>>>>> [    0.113000] masked ExtINT on CPU#0
>>>>>>>>
>>>>>>>>
>>>>>>>> This is why the above-mentioned Jailhouse crash occurs. I tried to
>>>>>>>> find
>>>>>>>> out why Linux takes this decision on AMD. Our victim is in
>>>>>>>> apic.c:1587.
>>>>>>>>
>>>>>>>> On Intel, apic_read(LVT0) & APIC_LVT_MASKED results in 65536, on
>>>>>>>> AMD it
>>>>>>>> is 0. This is why we take a different path.
>>>>>>>>
>>>>>>>> Now the question is simple -- why? :-)
>>>>>>>>
>>>>>>>> Are we just lacking ExtINT delivery mode in Jailhouse, or is
>>>>>>>> anything
>>>>>>>> else odd?
>>>>>>>
>>>>>>> Yes, the ExtINT makes no sense for secondary cells, and it should
>>>>>>> also
>>>>>>> not be needed for primary ones. Let's dig deeper:
>>>>>>>
>>>>>>> value = apic_read(APIC_LVT0) & APIC_LVT_MASKED;
>>>>>>> if (!cpu && (pic_mode || !value || skip_ioapic_setup)) {
>>>>>>>        value = APIC_DM_EXTINT;
>>>>>>>        apic_printk(APIC_VERBOSE, "enabled ExtINT on CPU#%d\n", cpu);
>>>>>>>
>>>>>>> What are the values here, and which are different?
>>>>>>
>>>>>> As already mentioned above, only value differs:
>>>>>>
>>>>>>>> On Intel, apic_read(LVT0) & APIC_LVT_MASKED results in 65536, on
>>>>>>>> AMD
>>>>>>>> it is 0. This is why we take a different path.
>>>>>>
>>>>>> cpu, pic_mode and skip_ioapic_setup is 0 on both machines.
>>>>>
>>>>> Ah, ok. Then you need to find the evil guy unmasking LVT0 before that.
>>>>> Can't be Jailhouse: we hand it over masked.
>>>>
>>>> Yes, I checked this. Actually we do. But...
>>>>
>>>> When the cell is created after jailhouse is enabled, apic_clear() will
>>>> be called when the SIPI is received. There, I added some
>>>> instrumentation. At that moment, LVT0 holds (and keeps) 0x10000.
>>>>
>>>> In addition to that, I instrumented the linux-loader. There, I read
>>>> back
>>>> LVT0. Very early, before we hand over to Linux. No one else touches
>>>> LVT0
>>>> in the meanwhile. I would see any other guest access as interceptions
>>>> are instrumented (both, read and write).
>>>>
>>>> So in the linux-loader, the read back causes a vmexit, and I read back
>>>> 0x0.  That's really strange, there is - afaict - no other access in the
>>>> meanwhile.
>>>>
>>>> I don't know what's going on there. I don't see any other modifications
>>>> of LVT registers in code paths other than apic_clear().
>>>
>>> Maybe you can lift the setup into KVM and check if you can reproduce
>>> there as well. That will allow to track down the other access that does
>>> the enabling. It shouldn't be possible that the hardware does that on
>>> its own.
>>
>> Tried to run Jailhouse on QEMU on a AMD machine with nested KVM.
>>
>> I currently see no way to test this on qemu, as Jailhouse seems to be
>> pretty unstable. We horribly crash in many situations on kvm:
>>
>>   - High chance of freezes when enabling jailhouse
>>   - I loose devices if I don't reroute interrupts to CPU0 before I
>>     create cells
>>   - cell destroy doesn't work. We freeze and after a while: "Ignoring NMI
>>     IPI to CPU 1"
>>   - Starting causes exceptions inside jailhouse
>>
>> So Jailhouse definitely runs more stable on bare-metal than on qemu/SVM.
>> I need to find another way to debug this.
> 
> OK...
> 
> Next strategy: Frequent read-back and validation of the APIC state. That
> may help to narrow down the point where the bit flips. Make sure you
> read on the right CPU, tough.

Finally…

I found the evil guy. It's inside apic_clear. The last call to the xapic:

apic.c @ apic_clear
        /* Finally, reset the TPR again and disable the APIC */
        apic_ops.write(APIC_REG_TPR, 0);
        apic_ops.write(APIC_REG_SVR, 0xff);

Disabling the xAPIC via APIC_REG_SVR will reset LVT0 and others to zero.
Seems we must not disable the xAPIC and hand it over enabled, because
then guest will read erroneous initial values.

Commenting out the last line solves the issue. We can now even boot on
multiple CPUs - everything seems to be fine so far.

At least on our machine,
  - disabling the xAPIC clears LVT* registers
  - any write to LVT* will be ignored as long as SVR is disabled
  - If SVR is re-enabled, LVT* will still hold 0, whatever was written
    to it before it was disabled

  Ralf

> 
> Jan
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/6de58d25-f1d9-b0ea-b0d8-631250e5ddeb%40oth-regensburg.de.
For more options, visit https://groups.google.com/d/optout.

Reply via email to