On Wed, 2023-08-23 at 22:16 +0100, Richard Purdie via lists.openembedded.org wrote: > On Tue, 2023-08-22 at 23:01 +0100, Richard Purdie via > lists.openembedded.org wrote: > > so the commands are stopping mid flow for unknown reasons or the ssh > > connection fails. I can't tell if this coincides with an rcu stall or > > not. Both logs do have rcu stalls in. > > > > After these failures the system does continue to otherwise work > > normally and subsequent tests pass. > > > > I wonder if the slow emulation might be causing the networking to > > glitch and break the ssh connection. > > > > I'm at a bit of a loss on where from here. > > I thought I'd update the thread with new information. > > I went back to the start with this and looked again and what is going > on. Interestingly, I found one of the autobuilder workers would > consistently fail the qemuppc-alt configuration for core-image-sato- > sdk. I paused the worker and experimented. > > I saw two different failures (included below). One shows systemd-udevd > timing out on it's watchdog after 3 minutes and resetting, including > taking out an ssh session running the cpio configure command. There was > no RCU stall reported. > > The second failure shows systemd-logind as well as systemd-udevd with > the 3 minute time out, the kernel complaining about missed IRQs, an RCU > stall and lots of breakage following including cut ssh commands. > > I could not get the cpio build test to complete. > > Interestingly, I came back to the same image/worker later this evening > and now it all works fine. The difference is earlier there was a world > build running on the worker, which continued to wind down even after I > paused the worker. By the evening, that background load was no longer > present and the ppc image works in isolation. This tells us the issue > is system load dependent and only occurs on loaded systems. > > I suspect I need to replicate the load and retry locally, see if I can > reliably reproduce the hang. The watchdog won't be present on sysvinit > systems which also show the issues but I'd guess there is still some > other starvation/timeout occurring.
I've now seen the failure on the autobuilder: * with linux-yocto 6.1.38 * with linux-yocto 6.1.46 * with qemu 8.0.4 * with qemu 8.0.3 * with qemu 8.0.0 I was a little suspicious of: "hw/ppc: Fix clock update drift" https://gitlab.com/qemu-project/qemu/-/commit/73d6ac24c81f1aeae554d469616c9181511e6523 but we've tested with and without that. qemu has just released 8.1.0 so perhaps we should try that next. I'm still struggling to pin down exactly which change caused the problems to start... Cheers, Richard
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#186662): https://lists.openembedded.org/g/openembedded-core/message/186662 Mute This Topic: https://lists.openembedded.org/mt/100733646/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
