This bug is awaiting verification that the linux-bluefield/6.8.0-1009.13
kernel in -proposed solves the problem. Please test the kernel and
update this bug with the results. If the problem is solved, change the
tag 'verification-needed-noble-linux-bluefield' to 'verification-done-
noble-linux-bluefield'. If the problem still exists, change the tag
'verification-needed-noble-linux-bluefield' to 'verification-failed-
noble-linux-bluefield'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-noble-linux-bluefield-v2 
verification-needed-noble-linux-bluefield

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2117123

Title:
  rcu: Eliminate deadlocks involving do_exit() and RCU tasks

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/2117123

  [Impact]

  Tracing tools, such as ebpf fentry programs, can be attached to tasks all the
  way to very late in do_exit(), and because of this, synchronize_rcu_tasks()
  needs to wait for the dying task to finish and the tracer to be removed, even
  though the task is no longer on the task list. This is explained on:

  3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done
  exiting")

  > Once a task has passed exit_notify() in the do_exit() code path, it is no
  > longer on the task lists, and is therefore no longer visible to
  > rcu_tasks_kthread().

  SRCU was created to handle this issue, to wait for tasks that could still be 
in
  a critical section, but no longer on the RCU tasks list. Unfortunately, there
  has been a class of deadlocks in do_exit() for years, that have been largely
  ignored, but was recently reproduced by a syzkaller script:

  
https://github.com/xupengfe/syzkaller_logs/blob/main/221115_105658_synchronize_rcu/repro.c

  Frederic Weisbecker provides the following analysis:

  1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
     that every subsequent child of TASK A will belong to. But TASK A doesn't
     itself belong to that new PID namespace.

  2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
     thread group leader). TASK A stays attached to its PID namespace (let's 
say PID_NS1)
     and TASK B is the first task belonging to the new PID namespace created by
     unshare()  (let's call it PID_NS2).

  3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
     child reaper.

  4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
     Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
     TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.

  3) TASK B exits and since it is the child reaper for PID_NS2, it has to
     kill all other tasks attached to PID_NS2, and wait for all of them to die
     before reaping itself (zap_pid_ns_process()). Note it seems to make a
     misleading assumption here, trusting that all tasks in PID_NS2 either
     get reaped by a parent belonging to the same namespace or by TASK B.
     And it is confident that since it deactivated SIGCHLD handler, all
     the remaining tasks ultimately autoreap. And it waits for that to happen.
     However TASK C escapes that rule because it will get reaped by its parent
     TASK A belonging to PID_NS1.

  4) TASK A calls synchronize_rcu_tasks() which leads to
     synchronize_srcu(&tasks_rcu_exit_srcu).

  5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
     But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
     (exit_notify() is between exit_tasks_rcu_start() and
     exit_tasks_rcu_finish()), blocking TASK A

  6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK 
C,
     but it can't because TASK A waits for TASK B that waits for TASK C.

  So there is a circular dependency:

  _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
  section
  _ TASK B waits for TASK C to get reaped
  _ TASK C waits for TASK A to reap it.

  An example stack trace is:

  kernel: INFO: task rcu_tasks_trace:15 blocked for more than 121 seconds.
  kernel:       Not tainted 6.8.0-63-generic #66-Ubuntu
  kernel: task:rcu_tasks_trace state:D stack:0     pid:15    tgid:15    ppid:2  
    flags:0x00004000
  kernel: Call Trace:
  kernel:  <TASK>
  kernel:  __schedule+0x27c/0x6b0
  kernel:  schedule+0x33/0x110
  kernel:  schedule_timeout+0x157/0x170
  kernel:  wait_for_completion+0x88/0x150
  kernel:  __wait_rcu_gp+0x17e/0x190
  kernel:  synchronize_rcu+0x12d/0x140
  kernel:  ? __pfx_call_rcu_hurry+0x10/0x10
  kernel:  ? __pfx_wakeme_after_rcu+0x10/0x10
  kernel:  rcu_tasks_trace_postscan+0xe/0x20
  kernel:  rcu_tasks_wait_gp+0x119/0x310
  kernel:  ? _raw_spin_lock_irqsave+0xe/0x20
  kernel:  ? rcu_tasks_need_gpcb+0x1f7/0x350
  kernel:  ? __pfx_rcu_tasks_kthread+0x10/0x10
  kernel:  rcu_tasks_one_gp+0x122/0x150
  kernel:  rcu_tasks_kthread+0xa4/0xd0
  kernel:  kthread+0xef/0x120
  kernel:  ? __pfx_kthread+0x10/0x10
  kernel:  ret_from_fork+0x44/0x70
  kernel:  ? __pfx_kthread+0x10/0x10
  kernel:  ret_from_fork_asm+0x1b/0x30
  kernel:  </TASK>
  kernel: task:system-probe    state:D stack:0     pid:1989  tgid:1931  
ppid:1926   flags:0x00000002
  kernel: Call Trace:
  kernel:  <TASK>
  kernel:  __schedule+0x27c/0x6b0
  kernel:  schedule+0x33/0x110
  kernel:  schedule_timeout+0x157/0x170
  kernel:  wait_for_completion+0x88/0x150
  kernel:  __wait_rcu_gp+0x17e/0x190
  kernel:  synchronize_rcu_tasks_generic+0x64/0xe0
  kernel:  ? __pfx_call_rcu_tasks_trace+0x10/0x10
  kernel:  ? __pfx_wakeme_after_rcu+0x10/0x10
  kernel:  synchronize_rcu_tasks_trace+0x15/0x20
  kernel:  perf_event_detach_bpf_prog+0x7d/0xe0
  kernel:  _free_event+0x20e/0x2a0
  kernel:  perf_event_release_kernel+0x281/0x2e0
  kernel:  perf_release+0x15/0x30
  kernel:  __fput+0xa0/0x2e0
  kernel:  __fput_sync+0x1c/0x30
  kernel:  __x64_sys_close+0x3e/0x90
  kernel:  x64_sys_call+0x1fec/0x25a0
  kernel:  do_syscall_64+0x7f/0x180
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? filp_flush+0x57/0x90
  kernel:  ? syscall_exit_to_user_mode+0x86/0x260
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? restore_fpregs_from_fpstate+0x3d/0xd0
  kernel:  ? switch_fpu_return+0x55/0xf0
  kernel:  ? filp_flush+0x57/0x90
  kernel:  ? syscall_exit_to_user_mode+0x86/0x260
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? filp_flush+0x57/0x90
  kernel:  ? syscall_exit_to_user_mode+0x86/0x260
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? do_syscall_64+0x8c/0x180
  kernel:  ? irqentry_exit_to_user_mode+0x7b/0x260
  kernel:  ? irqentry_exit+0x43/0x50
  kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80

  [Fix]

  The entire patchset is listed below. 3 out of the 7 have already been applied 
to
  ubuntu-noble due to being a dependency of another commit. We only need the 4
  missing commits.

  This was mainlined in 6.9-rc1 by the following commits:

  commit 2eb52fa8900e642b3b5054c4bf9776089d2a935f
  Author: Paul E. McKenney <[email protected]>
  Date:   Mon Dec 4 09:33:29 2023 -0800
  Subject: rcu-tasks: Repair RCU Tasks Trace quiescence check
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2eb52fa8900e642b3b5054c4bf9776089d2a935f
  Applied: Yes. ubuntu-noble 7e16c7d2a1ee

  commit bfe93930ea1ea3c6c115a7d44af6e4fea609067e
  Author: Paul E. McKenney <[email protected]>
  Date:   Mon Feb 5 13:08:22 2024 -0800
  Subject: rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfe93930ea1ea3c6c115a7d44af6e4fea609067e
  Applied: Yes. ubuntu-noble b9014deb33e6

  commit 30ef09635b9ed3ebca4f677495332a2e444a5cda
  Author: Paul E. McKenney <[email protected]>
  Date:   Thu Feb 22 12:29:54 2024 -0800
  Subject: rcu-tasks: Initialize callback lists at rcu_init() time
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30ef09635b9ed3ebca4f677495332a2e444a5cda
  Applied. No. Needed.

  commit 46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
  Author: Paul E. McKenney <[email protected]>
  Date:   Mon Feb 5 13:10:19 2024 -0800
  Subject: rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
  Applied: Yes. ubuntu-noble c8da4b0160db

  commit 6b70399f9ef3809f6e308fd99dd78b072c1bd05c
  Author: Paul E. McKenney <[email protected]>
  Date:   Fri Feb 2 11:28:45 2024 -0800
  Subject: rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6b70399f9ef3809f6e308fd99dd78b072c1bd05c
  Applied: No. Needed.

  commit 1612160b91272f5b1596f499584d6064bf5be794
  Author: Paul E. McKenney <[email protected]>
  Date:   Fri Feb 2 11:49:06 2024 -0800
  Subject: rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1612160b91272f5b1596f499584d6064bf5be794
  Applied: No. Needed.

  commit 0bb11a372fc8d7006b4d0f42a2882939747bdbff
  Author: Paul E. McKenney <[email protected]>
  Date:   Thu Feb 1 06:10:26 2024 -0800
  Subject: rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bb11a372fc8d7006b4d0f42a2882939747bdbff
  Applied: No. Needed.

  The 4 needed commits are all clean cherry picks.

  [Testcase]

  To reproduce the do_exit() deadlock using the syzkaller repro:

  $ sudo apt install build-essential
  $ wget 
https://raw.githubusercontent.com/xupengfe/syzkaller_logs/refs/heads/main/221115_105658_synchronize_rcu/repro.c
  $ gcc -o repro repro.c
  $ sudo ./repro
  $ journalctl -f -t kernel

  Due to the high regression risk of this patchset, we should run rcutorture, 
the
  rcu test suite, over a patched kernel to ensure there are no deadlocks.

  To run rcutorture on the kernel build:

  Documentation:
  https://docs.kernel.org/RCU/torture.html

  1) Clone the kernel source code
  2) Save the following patch to enable CONFIG_RCU_TORTURE_TEST to
  0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
  
https://launchpadlibrarian.net/805611005/0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
  3) $ git am 0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
  4) Build a new kernel with the patch applied, boot into it.
  5) $ modprobe rcutorture
  6) Follow dmesg.
  $ journalctl -f -t kernel
  kernel: rcu-torture: rcu_torture_read_exit: Start of episode
  kernel: rcu-torture: rcu_torture_read_exit: End of episode
  kernel: rcu_torture_fwd_prog_nr: 0 Duration 50060 cver 1081 gps 1490
  kernel: rcu_torture_fwd_prog_nr: Waiting for CBs: rcu_barrier+0x0/0x80() 0
  kernel: rcu-torture: rtc: 00000000c099ebf1 ver: 62341 tfle: 0 rta: 62342 
rtaf: 0 rtf: 62331 rtmbe: 0 rtmbkf: 0/48597 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 0 nt: 
1396993 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=1000) barrier: 0/0:0 read-exits: 1792 
nocb-toggles: 0:0
  kernel: rcu-torture: Reader Pipe:  2350715188 99444 0 0 0 0 0 0 0 0 0
  kernel: rcu-torture: Reader Batch:  2350551525 263107 0 0 0 0 0 0 0 0 0
  kernel: rcu-torture: Free-Block Circulation:  62341 62340 62339 62338 62336 
62335 62334 62333 62332 62331 0

  Read the documentation and ensure you see "Success" and no "FAILURE" messages.
  Ensure all the values that should be 0 are indeed 0.

  Leave rcutorture running for several hours / days.

  There is a test kernel available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf411904-config

  If you install it, it should not deadlock on the reproducer anymore, and you 
can
  also load the rcutorture kernel module for regression testing.

  [Where problems could occur]

  We are changing what happens to tasks that are late in do_exit(), and are now
  adding them to a new list to keep track of them while they could be in a RCU
  critical section.

  These are some large changes to the RCU subsystem, and it affects nearly other
  subsystem of the kernel, as RCU is used everywhere.

  If a regression were to occur, it would involve RCU grace periods getting 
stuck,
  leading to deadlocks and hung task timeouts with no real workarounds.

  We need to ensure we test this change with rcutorture for the whole duration 
the
  kernel is in -proposed for.

  [Other info]

  Upstream mailing list report:
  https://lore.kernel.org/lkml/[email protected]/T/#u

  Paul E. McKenney's architecture document:
  
https://docs.google.com/document/d/1hJxgiZ5TMZ4YJkdJPLAkRvq7sYQ-A7svgA8no6i-v8k/edit?usp=sharing

  syzkaller scripts, C reproducer, dmesg logs:
  
https://github.com/xupengfe/syzkaller_logs/tree/main/221115_105658_synchronize_rcu

  Upstream mailing list submission:
  
https://lore.kernel.org/lkml/[email protected]/T/#u

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2117123/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to