On 25.09.20 16:53, Jan Kiszka wrote:
> On 25.09.20 16:44, Jan Kiszka wrote:
>> On 25.09.20 16:21, Jan Kiszka wrote:
>>> On 17.09.20 10:36, Oliver Schwartz wrote:
>>>>
>>>>
>>>>> On 17 Sep 2020, at 09:31, Jan Kiszka <[email protected]> wrote:
>>>>>
>>>>> On 17.09.20 09:16, Oliver Schwartz wrote:
>>>>>>> On 15 Sep 2020, at 11:00, Jan Kiszka <[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>> On 15.09.20 09:07, Oliver Schwartz wrote:
>>>>>>>> I’m currently trying out the arm64-zero-exits branch and got stuck.
>>>>>>>> System is a Xilinx ZU9EG on a custom board, similar to zcu102. I’ve
>>>>>>>> brought ATF up to date and patched it with Jans patch to enable SDEI.
>>>>>>>> If I don’t enable SDEI in ATF everything works as expected (with VM
>>>>>>>> exits for interrupts, of course). Jailhouse source is the tip of
>>>>>>>> branch arm64-zero-exits.
>>>>>>>> If I enable SDEI in ATF, jailhouse works most of the time, except for
>>>>>>>> when it doesn’t. Sometimes, ‘jailhouse enable’ results in:
>>>>>>>>> Initializing processors:
>>>>>>>>> CPU 1... OK
>>>>>>>>> CPU 0...
>>>>>>>>> /home/oliver/0.12-gitAUTOINC+98061469d0-r0/git/hypervisor/arch/arm64/setup.c:73:
>>>>>>>>> returning error -EIO
>>>>>>>
>>>>>>> Weird - that the SDEI event enable call.
>>>>>>>
>>>>>>>>> FAILED
>>>>>>>>> JAILHOUSE_ENABLE: Input/output error
>>>>>>>> I’ve seen this error only when I enable jailhouse through some init
>>>>>>>> script during the boot process, when the system is also busy
>>>>>>>> otherwise. When starting jailhouse on an idle system I haven’t seen
>>>>>>>> this.
>>>>>>>
>>>>>>> Possibly a regression of my recent refactoring which I didn't manage to
>>>>>>> test yet. Could you try if
>>>>>>>
>>>>>>> https://github.com/siemens/jailhouse/commits/e0ef829c85895dc6387d5ea11b08aa65a456255f
>>>>>>>
>>>>>>> was any better?
>>>>>>>
>>>>>>>> Sometimes it may hang later during ‘jailhouse enable’:
>>>>>>>>> Initializing processors:
>>>>>>>>> CPU 1... OK
>>>>>>>>> CPU 0... OK
>>>>>>>>> CPU 2... OK
>>>>>>>>> CPU 3... OK
>>>>>>>>> Initializing unit: irqchip
>>>>>>>>> Using SDEI-based management interrupt
>>>>>>>>> Initializing unit: ARM SMMU v3
>>>>>>>>> Initializing unit: PVU IOMMU
>>>>>>>>> Initializing unit: PCI
>>>>>>>>> Adding virtual PCI device 00:00.0 to cell "root"
>>>>>>>>> Page pool usage after late setup: mem 67/992, remap 5/131072
>>>>>>>>> Activating hypervisor
>>>>>>>>> [ 5.847540] The Jailhouse is opening.
>>>>>>>> Using a JTAG debugger I see that one or more cores are stuck in
>>>>>>>> hypervisor/arch/arm-common/psci.c, line 105.
>>>>>>>> It may also succeed in stopping one or more CPUs and then hang (again
>>>>>>>> with one or more cores stuck in psci.c, line 105):
>>>>>>>>> [ 5.810220] The Jailhouse is opening.
>>>>>>>>> [ 5.860054] CPU1: shutdown
>>>>>>>>> [ 5.862677] psci: CPU1 killed.
>>>>>> Now, with the first problem solved I’ve digged into the second one. It’s
>>>>>> actually a bit worse than in my initial description: If I just do
>>>>>> ‘jailhouse enable’ the system will always hang a few milliseconds after
>>>>>> the command completes - the only exception is when ‘jailhouse create’ is
>>>>>> executed immediately afterwards (which creates an inmate that uses 3 of
>>>>>> 4 CPU cores, leaving just one for Linux), which succeeds roughly on
>>>>>> every second try. I didn’t notice this initially because I usually start
>>>>>> jailhouse with a script that does ‘enable’ and ‘create’.
>>>>>> The reason for the hangs seems to be the psci emulation in Jailhouse, in
>>>>>> particular the CPU_SUSPEND calls. These are issued from my (Xilinx-)
>>>>>> kernel frequently if Linux has more than one core available. With SDEI
>>>>>> disabled the core can be woken up again by some interrupt. With SDEI
>>>>>> enabled, the core waits forever on the wfi intstruction. Because a
>>>>>> suspended core never wakes up again the whole system hangs at some point.
>>>>>> Any ideas why no interrupts are seen anymore in psci? My guess is that
>>>>>> it’s because the inmate (Linux) now has full control over the GIC, so it
>>>>>> may disable any interrupts before suspending a core, without Jailhouse
>>>>>> noticing. If this is the case, it may be necessary to re-enable the IRQs
>>>>>> before executing wfi. But I’m missing the big picture here - what
>>>>>> interrupt is the core waiting for in the first place? Any other thoughts?
>>>>>
>>>>> You likely found a bug in the SDEI feature of Jailhouse. The CPU_SUSPEND
>>>>> emulation assumes non-SDEI operation, i.e. interception of interrupts by
>>>>> the hypervisor, but that is not true in this mode.
>>>>>
>>>>> We need a way to wait for interrupts without actually receiving them when
>>>>> they arrive, but rather return to EL1 then. Maybe re-enabling
>>>>> interception, waiting, and then disabling it again before returning would
>>>>> do the trick. But then I also do not understand yet why
>>>>> https://github.com/bao-project/bao-hypervisor/blob/master/src/arch/armv8/psci.c
>>>>> gets away with wfi. Possibly, they run with interrupts on through the
>>>>> hypervisor, though that would not be straightforward either.
>>>>
>>>> The good news is that there’s an easy workaround, at least on my system:
>>>> disabling suspend calls before starting jailhouse
>>>> (echo 1 > /sys/devices//system/cpu/cpu<n>/cpuidle/state1/disable).
>>>>
>>>
>>> Seems the reason I was not seeing this so far is that my config [1] was
>>> lacking CONFIG_ARM_PSCI_CPUIDLE. Seeing it now as well, let's debug.
>>>
>>
>> My ideas seems to work (quick hack):
>>
>> diff --git a/hypervisor/arch/arm-common/psci.c
>> b/hypervisor/arch/arm-common/psci.c
>> index 6a9abf60..3bb3f6a8 100644
>> --- a/hypervisor/arch/arm-common/psci.c
>> +++ b/hypervisor/arch/arm-common/psci.c
>> @@ -101,6 +101,14 @@ long psci_dispatch(struct trap_context *ctx)
>>
>> case PSCI_0_2_FN_CPU_SUSPEND:
>> case PSCI_0_2_FN64_CPU_SUSPEND:
>> + if (sdei_available) {
>> + unsigned long hcr;
>> + arm_read_sysreg(HCR_EL2, hcr);
>> + arm_write_sysreg(HCR_EL2,
>> + hcr | HCR_IMO_BIT | HCR_FMO_BIT);
>> + asm volatile("wfi" : : : "memory");
>> + arm_read_sysreg(HCR_EL2, hcr);
^^^^^^
Argh...
>> + } else
>> if (!irqchip_has_pending_irqs()) {
>> asm volatile("wfi" : : : "memory");
>> irqchip_handle_irq();
>>
>> Now, if someone with more architectural knowledge than I could explain
>> why that's the case and if that will work on all platforms, with both
>> GICv2 and v3 (and maybe even v4), we could convert that into real patch.
>> Trying my luck on the CC list...
>>
>
> Nää, I was too quick: wfi works, i.e. the hypervisor is woken up on
> pending interrupts, but some more bits than simply clearing IMO/FMO in
> HCR are needed in order to forward that pending irq event to EL1 when
> returning to it.
>
With the above fixed to 'write', it now works fine here.
While trying to understand what goes wrong, I also read what happens in
the IMO/FMO-disabled case when an interrupt arrives while in EL2: It
does not count as wakeup event for wfi. Only if IMO/FMO are set (or
TGE), we are kicked out of wfi. And that's actually independent of the
GIC model.
What's once again unclear to me is if/where I need isb. Before the wfi?
Also after restoring HCR_EL2?
Thanks,
Jan
--
You received this message because you are subscribed to the Google Groups
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/jailhouse-dev/1efb0fb2-8acf-0ebc-00bf-ce059c9ae833%40web.de.