Matthieu! On Mon, Mar 09 2026 at 13:23, Matthieu Baerts wrote: > On 09/03/2026 09:43, Thomas Gleixner wrote: >> That should provide enough information to decode this mystery.
That was wishful thinking, but at least it narrows down the search space. > Thank you for the debug patch and the clear instructions. I managed to > reproduce the issue with the extra debug. The ouput is available here: > > https://github.com/user-attachments/files/25841808/issue-617-debug.txt.gz Thank you for testing. So what I can see from the trace is: [ 2.101917] virtme-n-68 3d..1. 703536us : mmcid_user_add: t=00000000e4425b1d mm=00000000a22be644 users=3 [ 2.102057] virtme-n-68 3d..1. 703537us : mmcid_getcid: mm=00000000a22be644 cid=00000002 [ 2.102195] virtme-n-68 3d..2. 703548us : sched_switch: prev_comm=virtme-ng-init prev_pid=68 prev_prio=120 prev_state=D ==> next_comm=swapper/3 next_pid=0 next_prio=120 This one creates the third thread related to the mm and schedules out. The new thread schedules in a moment later: [ 2.102828] <idle>-0 2d..2. 703565us : sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=virtme-ng-init next_pid=69 next_prio=120 [ 2.103039] <idle>-0 2d..2. 703567us : mmcid_cpu_update: cpu=2 mm=00000000a22be644 cid=00000002 [ 2.104283] <idle>-0 0d..2. 703642us : sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=virtme-ng-init next_pid=1 next_prio=120 [ 2.104493] <idle>-0 0d..2. 703643us : mmcid_cpu_update: cpu=0 mm=00000000a22be644 cid=00000000 virtme-n-1 owns CID 0 and after scheduled in it creates the 4th thread, which is still in the CID space (0..3) [ 2.104616] virtme-n-1 0d..1. 703690us : mmcid_user_add: t=0000000031a5ee91 mm=00000000a22be644 users=4 Unsurprisingly this assignes CID 3: [ 2.104757] virtme-n-1 0d..1. 703691us : mmcid_getcid: mm=00000000a22be644 cid=00000003 And the newly created task schedules in on CPU3: [ 2.104880] <idle>-0 3d..2. 703708us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=virtme-ng-init next_pid=70 next_prio=120 [ 2.105091] <idle>-0 3d..2. 703708us : mmcid_cpu_update: cpu=3 mm=00000000a22be644 cid=00000003 Now n-1 continues and creates the 5th thread: [ 2.105227] virtme-n-1 0d..1. 703730us : mmcid_user_add: t=00000000f2e4a8c8 mm=00000000a22be644 users=5 which makes it switch to per CPU ownership mode. Then it continues to go through the tasks in mm_cid_do_fixup_tasks_to_cpus() and fixes up: [ 2.105368] virtme-n-1 0d..1. 703730us : mmcid_task_update: t=00000000c923c125 mm=00000000a22be644 cid=20000000 [ 2.105509] virtme-n-1 0d..1. 703731us : mmcid_cpu_update: cpu=0 mm=00000000a22be644 cid=20000000 Itself to be in TRANSIT mode [ 2.105632] virtme-n-1 0d..2. 703731us : mmcid_task_update: t=00000000478c5e8d mm=00000000a22be644 cid=80000000 [ 2.105773] virtme-n-1 0d..2. 703731us : mmcid_putcid: mm=00000000a22be644 cid=00000001 Drops the CID of one task which is not on a CPU [ 2.105896] virtme-n-1 0d..2. 703731us : mmcid_task_update: t=0000000031a5ee91 mm=00000000a22be644 cid=20000003 [ 2.106037] virtme-n-1 0d..2. 703731us : mmcid_cpu_update: cpu=3 mm=00000000a22be644 cid=20000003 and puts the third one correctly into TRANSIT mode [ 2.106174] virtme-n-69 2d..2. 703736us : sched_switch: prev_comm=virtme-ng-init prev_pid=69 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 Here the one which owns CID 2 schedules out without notice, which is just wrong as the above should have already moved it over to TRANSIT mode. Why didn't that happen? So the only circumstances where mm_cid_do_fixup_tasks_to_cpus() fails to do that are: 1) task->mm != mm. or 2) task is not longer in the task list w/o going through do_exit() How the heck is either one of them possible? Just for the record. The picture Jiri decoded from the VM crash dump is exactly the same. One task is not listed. Confused tglx

