Hi Stefano,

Thank you for your reply!

On 06/02/2026 17:38, Stefano Garzarella wrote:
> On Fri, Feb 06, 2026 at 12:54:13PM +0100, Matthieu Baerts wrote:
>> Hi Stefan, Stefano, + VM, RCU, sched people,
> 
> Hi Matt,
> 
>>
>> First, I'm sorry to cc a few MLs, but I'm still trying to locate the
>> origin of the issue I'm seeing.
>>
>> Our CI for the MPTCP subsystem is now regularly hitting various stalls
>> before even starting the MPTCP test suite. These issues are visible on
>> top of the latest net and net-next trees, which have been sync with
>> Linus' tree yesterday. All these issues have been seen on a "public CI"
>> using GitHub-hosted runners with KVM support, where the tested kernel is
>> launched in a nested (I suppose) VM. I can see the issue with or without
> 
> Just to be sure I'm on the same page, the issue is in the most nested
> guest, right? (the last VM started)

That's correct. From what I see [1], each GitHub-hosted runner is a new
VM, and I'm launching QEmu from there.

[1]
https://docs.github.com/en/actions/concepts/runners/github-hosted-runners

>> debug.config. According to the logs, it might have started around
>> v6.19-rc0, but I was unavailable for a few weeks, and I couldn't react
>> quicker, sorry for that. Unfortunately, I cannot reproduce this locally,
>> and the CI doesn't currently have the ability to execute bisections.
>>
>> The stalls happen before starting the MPTCP test suite. The init program
>> creates a VSOCK listening socket via socat [1], and different hangs are
>> then visible: RCU stalls followed by a soft lockup [2], only a soft
>> lockup [3], sometimes the soft lockup comes with a delay [4] [5], or
>> there is no RCU stalls or soft lockups detected after one minute, but VM
>> is stalled [6]. In the last case, the VM is stopped after having
>> launched GDB to get more details about what was being executed.
>>
>> It feels like the issue is not directly caused by the VSOCK listening
>> socket, but the stalls always happen after having started the socat
>> command [1] in the background.
>>
>> One last thing: I thought my issue was linked to another one seen on XFS
>> side and reported by Shinichiro Kawasaki [7], but apparently not.
>> Indeed, Paul McKenney mentioned Shinichiro's issue is probably fixed by
>> Thomas Gleixner's series called "sched/mmcid: Cure mode transition woes"
>> [8]. I applied these patches from Peter Zijlstra's tree from
>> tip/sched/urgent [9], and my issue is still present.
>>
>> Any idea what could cause that, where to look at, or what could help to
>> find the root cause?
> 
> Mmm, nothing comes to mind at the vsock side :-(

That's OK, thank you for having checked! I hope someone else in CC can
help me finding the root cause!

> I understand that bisection can't be done in the CI env, but can you
> confirm in some way that 6.18 is working right with the same userspace?

Yes, I can confirm that. We run the tests on both the dev ("export") and
fixes ("export-net") branches, but also on stable versions:

  https://ci-results.mptcp.dev/flakes.html

(The "critical issues" have their headers red)

We don't see such issues in v6.18 and old kernels.

> That could help to try to identify at least if there is anything in
> AF_VSOCK we merged recently that can trigger that.

Our dev branch is on top of net-next, I guess I would have seen issues
directly related to AF_VSOCK earlier than after the net-next freeze in
January. Here, it looks like the first issues came during Linus' merge
window from the beginning of December, e.g. [2] is from the 4th of
December, on top of 'net' which was at commit 8f7aa3d3c732 ("Merge tag
'net-next-6.19' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") from
Linus tree.

[2]
https://github.com/multipath-tcp/mptcp_net-next/actions/runs/19919313666/job/57104626001#step:7:5052

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


Reply via email to