This bug is missing log files that will aid in diagnosing the problem.
While running an Ubuntu kernel (not a mainline or third-party kernel)
please enter the following command in a terminal window:

apport-collect 1821259

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable
to run this command, please add a comment stating that fact and change
the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the
Ubuntu Kernel Team.

** Changed in: linux (Ubuntu)
       Status: New => Incomplete

** Tags added: xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
     soft locks up other CPUs as an effect; the system becomes
     unusable.

   * This is relatively rare / difficult to hit because it's a
     corner case in scheduling/load balancing that needs timing
     with CPU stopper code. And it needs SMP plus _NUMA_ system.
     (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
     a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
     the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
     (although without the stack traces - just a system hang)
     attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
     to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

     Some dmesg logging occurs, and systems either hangs or not.
     See examples in comments.
     
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
     change a bit how it works;  however, there are no upstream
     fixes for these patches anymore and they are still the top
     of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
     and 'stress-ng --class scheduler --sequential 0' (no regressions)
     on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
ffff883ff2a76200 ti: ffff883ff2110000 task.ti: ffff883ff2110000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[<ffffffff810c8cb0>]  [<ffffffff810c8cb0>] 
native_queued_spin_lock_slowpath+0x160/0x170
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
0000:ffff883ff2113c58  EFLAGS: 00000002
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 
0000000000000101 RBX: 0000000000000086 RCX: 0000000000000001
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 
0000000000000101 RSI: 0000000000000001 RDI: ffff881fff991ba8
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: 
ffff883ff2113c58 R08: 0000000000000101 R09: ffff883ff082e200
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 
0000000000002e04 R11: 0000000000002e04 R12: ffff881fff997c60
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: 
ffff881fff991ba8 R14: 0000000000000000 R15: ffff881fff997300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS: 0000 
ES: 0000 CR0: 0000000080050033
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 
00007f7caaa23020 CR3: 0000001f46740000 CR4: 0000000000160670
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  ffff883ff2113c68 
ffffffff811870eb ffff883ff2113c80 ffffffff81819907
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  ffff881fff991ba0 
ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096]  ffff881fff997c90 
ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105]  
[<ffffffff811870eb>] queued_spin_lock_slowpath+0xb/0xf
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109]  
[<ffffffff81819907>] _raw_spin_lock_irqsave+0x37/0x40
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113]  
[<ffffffff8111c600>] cpu_stop_queue_work+0x30/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116]  
[<ffffffff8111ccd0>] stop_one_cpu_nowait+0x30/0x40
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119]  
[<ffffffff810bbb5b>] load_balance+0x71b/0x940
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122]  
[<ffffffff810bbff5>] pick_next_task_fair+0x275/0x4b0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126]  
[<ffffffff81816166>] __schedule+0x6c6/0x7f0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132]  
[<ffffffff810a2560>] ? sort_range+0x30/0x30
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134]  
[<ffffffff818162c5>] schedule+0x35/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136]  
[<ffffffff810a262d>] smpboot_thread_fn+0xcd/0x180
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139]  
[<ffffffff8109f138>] kthread+0xd8/0xf0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141]  
[<ffffffff8109f060>] ? kthread_park+0x60/0x60
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143]  
[<ffffffff81819ff5>] ret_from_fork+0x55/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144]  
[<ffffffff8109f060>] ? kthread_park+0x60/0x60

      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 6
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in: 
<...>
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932 
Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: 
ffff881ff03dd400 ti: ffff883cda77c000 task.ti: ffff883cda77c000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP: 
0010:[<ffffffff810aacb6>]  [<ffffffff810aacb6>] try_to_wake_up+0x86/0x3f0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP: 
0000:ffff883cda77fa78  EFLAGS: 00000002
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 
0000000000000001 RBX: ffff883ff2a76200 RCX: 0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 
0000000000000001 RSI: 0000000000000003 RDI: ffff883ff2a768d4
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: 
ffff883cda77fab8 R08: 000000000000000a R09: ffff881ff03dd400
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 
0000000000000001 R11: 0000000000000000 R12: 0000000000017300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: 
ffff883ff2a768d4 R14: 0000000000000046 R15: 0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS:  
00007ff8ecbc9700(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS:  0010 DS: 0000 
ES: 0000 CR0: 0000000080050033
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 
0000000014583550 CR3: 0000003d4ac96000 CR4: 0000000000160670
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365]  0000000000000202 
ffff883cda77fa98 0000000000000003 0000000000000006
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368]  000000000000000a 
ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370]  ffff883cda77fac8 
ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375]  
[<ffffffff810ab035>] wake_up_process+0x15/0x20
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379]  
[<ffffffff8111cc22>] stop_two_cpus+0x1b2/0x230
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382]  
[<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384]  
[<ffffffff810b5d15>] ? dequeue_entity+0x455/0x8a0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386]  
[<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388]  
[<ffffffff810aaa70>] ? __migrate_swap_task.part.83+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390]  
[<ffffffff810ab18e>] migrate_swap+0xae/0x130
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392]  
[<ffffffff810b4e44>] task_numa_migrate+0x504/0x930
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394]  
[<ffffffff810b52e9>] numa_migrate_preferred+0x79/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396]  
[<ffffffff810b9373>] task_numa_fault+0x923/0xcd0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400]  
[<ffffffff8175e407>] ? tcp_recvmsg+0x6b7/0xbd0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404]  
[<ffffffff811da9be>] ? mpol_misplaced+0x14e/0x190
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408]  
[<ffffffff811b7836>] handle_pte_fault+0x5a6/0x1440
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411]  
[<ffffffff816f6693>] ? sock_recvmsg+0x43/0x50
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413]  
[<ffffffff811b9540>] handle_mm_fault+0x250/0x540
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417]  
[<ffffffff81069e1a>] __do_page_fault+0x19a/0x430
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419]  
[<ffffffff8106a0d2>] do_page_fault+0x22/0x30
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423]  
[<ffffffff8181c5a8>] page_fault+0x28/0x30

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1821259/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to