On 29.09.21 11:48, Martin Kaistra wrote: > Am 29.09.21 um 11:12 schrieb Jan Kiszka: >> On 29.09.21 09:50, Martin Kaistra wrote: >>> Am 28.09.21 um 11:28 schrieb Jan Kiszka: >>>> On 27.09.21 10:30, Martin Kaistra wrote: >>>>> Am 24.09.21 um 17:46 schrieb Jan Kiszka: >>>>>> >>>>>> If suspend_cpu() does not progress, the target CPU is not reacting >>>>>> properly on the request to leave the guest and service the Jailhouse >>>>>> commands. Could be that you interrupts are not handles properly. Run >>>>>> "jailhouse config check" on your setup, maybe you are passing the >>>>>> interrupt controller through. >>>>>> >>>>>> Or are you using SDEI-based management interrupts? Would require a >>>>>> special TF-A version, so likely does not happen "by chance". >>>>>> >>>>>> Jan >>>>>> >>>>> >>>>> Hi Jan, >>>>> >>>>> "jailhouse config check" finds no problems with the root cell and >>>>> inmate >>>>> configs. >>>>> Also, SDEI is not active. gicv2_send_sgi() is being used. >>>>> >>>> >>>> Then it would be good to continue debugging, now trying to understand >>>> what the target CPU is doing. >>>> >>>> The CPU that requests the suspend sets suspend_cpu in the target data >>>> structure, then sends an IPI to that CPU and wait for the other side to >>>> confirm this via setting cpu_suspended. Check if the target CPU >>>> received >>>> the IPI, left the guest mode or what else it does by instrumenting the >>>> related code paths (check_events on arm64). >>>> >>>> Jan >>>> >>> >>> The times, when there is no freeze, I can see after cpu0 calls >>> arch_send_event() -> gicv2_send_sgi() from suspend_cpu(), on cpu1 there >>> is irqchip_handle_irq() -> arch_handle_sgi() -> check_events(). >>> >>> However in the not working case, after going into suspend_cpu() on cpu0, >>> there seem to be no interrupts landing on cpu1, I get no debug prints >>> from irqchip_handle_irq or check_events. >> >> But there also arch_send_event() called in the broken case? >> >> And in both cases cpu1 is inside the guest when the suspension request >> is started? > > Yes, arch_send_event() is also called in the broken case. These are the > logs with my added annotations: > > broken case: > > > > .... > > Activating hypervisor > > psci_dispatch: 0xc4000001 > > [ 18.583357] The Jailhouse is opening. > > gicv2_send_sgi: cpu 0 > > irqchip_handle_irq (sgi, cpu 1) > > gicv2_send_sgi: cpu 0 > > gicv2_send_sgi: cpu 1 > > gicv2_send_sgi: cpu 0 > > irqchip_handle_irq (sgi, cpu 1) > > irqchip_handle_irq (sgi, cpu 0) > > irqchip_handle_irq (sgi, cpu 1) > > gicv2_send_sgi: cpu 0 > > irqchip_handle_irq (sgi, cpu 1) > > .... > > [ 18.681300] CPU1: shutdown > > psci_dispatch: 0x84000002 > > psci_dispatch: 0xc4000004 > > [ 18.688683] psci: CPU1 killed (polled 4 ms) > > [ 18.693551] All CPUs removed! > > cell_suspend: Running on cpu #0 > > About to suspend cpu #1 > > suspend_cpu() > > arch_send_event > > gicv2_send_sgi: cpu 0 > > suspend_cpu() loop > > > > ========================================================= > > working case: > > > > .... > > Activating hypervisor > > gicv2_send_sgi: cpu 1 > > irqchip_handle_irq (sgi, cpu 0) > > gicv2_send_sgi: cpu 1 > > irqchip_handle_irq (sgi, cpu 0) > > [ 17.908806] The Jailhouse is opening. > > gicv2_send_sgi: cpu 1 > > gicv2_send_sgi: cpu 0 > > irqchip_handle_irq (sgi, cpu 1) > > irqchip_handle_irq (sgi, cpu 0) > > gicv2_send_sgi: cpu 1 > > gicv2_send_sgi: cpu 0 > > irqchip_handle_irq (sgi, cpu 1) > > irqchip_handle_irq (sgi, cpu 0) > > .... > > psci_dispatch: 0x84000002 > > [ 18.008498] CPU1: shutdown > > psci_dispatch: 0xc4000004 > > [ 18.014133] psci: CPU1 killed (polled 4 ms) > > [ 18.019385] All CPUs removed! > > cell_suspend: Running on cpu #0 > > About to suspend cpu #1 > > suspend_cpu() > > arch_send_event > > gicv2_send_sgi: cpu 0
I assume, "cpu 0" means the sending CPU, not the target. Let's also dump the value that gicv2_send_sgi writes to GICD_SGIR, to check if it's the same in both cases. Furthermore, it would be good to instrument vm-entry/exit to identify if the CPU 1 is in guest or host mode. Jan > > suspend_cpu() loop > > irqchip_handle_irq (sgi, cpu 1) > > check_events: running on cpu #1 > > Created cell "inmate-demo" > > Page pool usage after cell creation: mem 62/992, remap 37/131072 > > [ 18.055831] Created Jailhouse cell "inmate-demo" > >> >>> >>> Maybe there is a HW problem? But why does it seem to work sometimes.. >>> >> >> I would call for a HW problem only after truly excluding all software >> issues. >> >> Jan >> > -- You received this message because you are subscribed to the Google Groups "Jailhouse" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/jailhouse-dev/f987bee6-cb32-efd1-9baa-541185f20479%40web.de.
