Hi Stefano, Thank you for your reply!
On 06/02/2026 17:38, Stefano Garzarella wrote: > On Fri, Feb 06, 2026 at 12:54:13PM +0100, Matthieu Baerts wrote: >> Hi Stefan, Stefano, + VM, RCU, sched people, > > Hi Matt, > >> >> First, I'm sorry to cc a few MLs, but I'm still trying to locate the >> origin of the issue I'm seeing. >> >> Our CI for the MPTCP subsystem is now regularly hitting various stalls >> before even starting the MPTCP test suite. These issues are visible on >> top of the latest net and net-next trees, which have been sync with >> Linus' tree yesterday. All these issues have been seen on a "public CI" >> using GitHub-hosted runners with KVM support, where the tested kernel is >> launched in a nested (I suppose) VM. I can see the issue with or without > > Just to be sure I'm on the same page, the issue is in the most nested > guest, right? (the last VM started) That's correct. From what I see [1], each GitHub-hosted runner is a new VM, and I'm launching QEmu from there. [1] https://docs.github.com/en/actions/concepts/runners/github-hosted-runners >> debug.config. According to the logs, it might have started around >> v6.19-rc0, but I was unavailable for a few weeks, and I couldn't react >> quicker, sorry for that. Unfortunately, I cannot reproduce this locally, >> and the CI doesn't currently have the ability to execute bisections. >> >> The stalls happen before starting the MPTCP test suite. The init program >> creates a VSOCK listening socket via socat [1], and different hangs are >> then visible: RCU stalls followed by a soft lockup [2], only a soft >> lockup [3], sometimes the soft lockup comes with a delay [4] [5], or >> there is no RCU stalls or soft lockups detected after one minute, but VM >> is stalled [6]. In the last case, the VM is stopped after having >> launched GDB to get more details about what was being executed. >> >> It feels like the issue is not directly caused by the VSOCK listening >> socket, but the stalls always happen after having started the socat >> command [1] in the background. >> >> One last thing: I thought my issue was linked to another one seen on XFS >> side and reported by Shinichiro Kawasaki [7], but apparently not. >> Indeed, Paul McKenney mentioned Shinichiro's issue is probably fixed by >> Thomas Gleixner's series called "sched/mmcid: Cure mode transition woes" >> [8]. I applied these patches from Peter Zijlstra's tree from >> tip/sched/urgent [9], and my issue is still present. >> >> Any idea what could cause that, where to look at, or what could help to >> find the root cause? > > Mmm, nothing comes to mind at the vsock side :-( That's OK, thank you for having checked! I hope someone else in CC can help me finding the root cause! > I understand that bisection can't be done in the CI env, but can you > confirm in some way that 6.18 is working right with the same userspace? Yes, I can confirm that. We run the tests on both the dev ("export") and fixes ("export-net") branches, but also on stable versions: https://ci-results.mptcp.dev/flakes.html (The "critical issues" have their headers red) We don't see such issues in v6.18 and old kernels. > That could help to try to identify at least if there is anything in > AF_VSOCK we merged recently that can trigger that. Our dev branch is on top of net-next, I guess I would have seen issues directly related to AF_VSOCK earlier than after the net-next freeze in January. Here, it looks like the first issues came during Linus' merge window from the beginning of December, e.g. [2] is from the 4th of December, on top of 'net' which was at commit 8f7aa3d3c732 ("Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") from Linus tree. [2] https://github.com/multipath-tcp/mptcp_net-next/actions/runs/19919313666/job/57104626001#step:7:5052 Cheers, Matt -- Sponsored by the NGI0 Core fund.

