On 29.09.21 11:48, Martin Kaistra wrote:
> Am 29.09.21 um 11:12 schrieb Jan Kiszka:
>> On 29.09.21 09:50, Martin Kaistra wrote:
>>> Am 28.09.21 um 11:28 schrieb Jan Kiszka:
>>>> On 27.09.21 10:30, Martin Kaistra wrote:
>>>>> Am 24.09.21 um 17:46 schrieb Jan Kiszka:
>>>>>>
>>>>>> If suspend_cpu() does not progress, the target CPU is not reacting
>>>>>> properly on the request to leave the guest and service the Jailhouse
>>>>>> commands. Could be that you interrupts are not handles properly. Run
>>>>>> "jailhouse config check" on your setup, maybe you are passing the
>>>>>> interrupt controller through.
>>>>>>
>>>>>> Or are you using SDEI-based management interrupts? Would require a
>>>>>> special TF-A version, so likely does not happen "by chance".
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>
>>>>> Hi Jan,
>>>>>
>>>>> "jailhouse config check" finds no problems with the root cell and
>>>>> inmate
>>>>> configs.
>>>>> Also, SDEI is not active. gicv2_send_sgi() is being used.
>>>>>
>>>>
>>>> Then it would be good to continue debugging, now trying to understand
>>>> what the target CPU is doing.
>>>>
>>>> The CPU that requests the suspend sets suspend_cpu in the target data
>>>> structure, then sends an IPI to that CPU and wait for the other side to
>>>> confirm this via setting cpu_suspended. Check if the target CPU
>>>> received
>>>> the IPI, left the guest mode or what else it does by instrumenting the
>>>> related code paths (check_events on arm64).
>>>>
>>>> Jan
>>>>
>>>
>>> The times, when there is no freeze, I can see after cpu0 calls
>>> arch_send_event() -> gicv2_send_sgi() from suspend_cpu(), on cpu1 there
>>> is irqchip_handle_irq() -> arch_handle_sgi() -> check_events().
>>>
>>> However in the not working case, after going into suspend_cpu() on cpu0,
>>> there seem to be no interrupts landing on cpu1, I get no debug prints
>>> from irqchip_handle_irq or check_events.
>>
>> But there also arch_send_event() called in the broken case?
>>
>> And in both cases cpu1 is inside the guest when the suspension request
>> is started?
>
> Yes, arch_send_event() is also called in the broken case. These are the
> logs with my added annotations:
>
> broken case:
>
>
>
> ....
>
> Activating hypervisor
>
> psci_dispatch: 0xc4000001
>
> [   18.583357] The Jailhouse is opening.
>
> gicv2_send_sgi: cpu 0
>
> irqchip_handle_irq (sgi, cpu 1)
>
> gicv2_send_sgi: cpu 0
>
> gicv2_send_sgi: cpu 1
>
> gicv2_send_sgi: cpu 0
>
> irqchip_handle_irq (sgi, cpu 1)
>
> irqchip_handle_irq (sgi, cpu 0)
>
> irqchip_handle_irq (sgi, cpu 1)
>
> gicv2_send_sgi: cpu 0
>
> irqchip_handle_irq (sgi, cpu 1)
>
> ....
>
> [   18.681300] CPU1: shutdown
>
> psci_dispatch: 0x84000002
>
> psci_dispatch: 0xc4000004
>
> [   18.688683] psci: CPU1 killed (polled 4 ms)
>
> [   18.693551] All CPUs removed!
>
> cell_suspend: Running on cpu #0
>
> About to suspend cpu #1
>
> suspend_cpu()
>
> arch_send_event
>
> gicv2_send_sgi: cpu 0
>
> suspend_cpu() loop
>
>
>
> =========================================================
>
> working case:
>
>
>
> ....
>
> Activating hypervisor
>
> gicv2_send_sgi: cpu 1
>
> irqchip_handle_irq (sgi, cpu 0)
>
> gicv2_send_sgi: cpu 1
>
> irqchip_handle_irq (sgi, cpu 0)
>
> [   17.908806] The Jailhouse is opening.
>
> gicv2_send_sgi: cpu 1
>
> gicv2_send_sgi: cpu 0
>
> irqchip_handle_irq (sgi, cpu 1)
>
> irqchip_handle_irq (sgi, cpu 0)
>
> gicv2_send_sgi: cpu 1
>
> gicv2_send_sgi: cpu 0
>
> irqchip_handle_irq (sgi, cpu 1)
>
> irqchip_handle_irq (sgi, cpu 0)
>
> ....
>
> psci_dispatch: 0x84000002
>
> [   18.008498] CPU1: shutdown
>
> psci_dispatch: 0xc4000004
>
> [   18.014133] psci: CPU1 killed (polled 4 ms)
>
> [   18.019385] All CPUs removed!
>
> cell_suspend: Running on cpu #0
>
> About to suspend cpu #1
>
> suspend_cpu()
>
> arch_send_event
>
> gicv2_send_sgi: cpu 0

I assume, "cpu 0" means the sending CPU, not the target. Let's also dump
the value that gicv2_send_sgi writes to GICD_SGIR, to check if it's the
same in both cases.

Furthermore, it would be good to instrument vm-entry/exit to identify if
the CPU 1 is in guest or host mode.

Jan

>
> suspend_cpu() loop
>
> irqchip_handle_irq (sgi, cpu 1)
>
> check_events: running on cpu #1
>
> Created cell "inmate-demo"
>
> Page pool usage after cell creation: mem 62/992, remap 37/131072
>
> [   18.055831] Created Jailhouse cell "inmate-demo"
>
>>
>>>
>>> Maybe there is a HW problem? But why does it seem to work sometimes..
>>>
>>
>> I would call for a HW problem only after truly excluding all software
>> issues.
>>
>> Jan
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/f987bee6-cb32-efd1-9baa-541185f20479%40web.de.

Reply via email to