** Changed in: ubuntu-power-systems
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1768898

Title:
  smp_call_function_single/many core hangs with stop4 alone

Status in The Ubuntu-power-systems project:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  == SRU Justification ==
  IBM reports that this bug occurs with stop4 which results in soft lockups/rcu 
stalls.
  This is a kernel synchronization issue leading to a dead lock.

  This bug was introduced by commit 7bc54b652f13 in v4.8-rc1.  This
  regression is fixed by mainline commit c0f7f5b6c6910.

  == Fix ==
  c0f7f5b6c6910 ("cpufreq: powernv: Fix hardlockup due to synchronous smp_call 
in timer interrupt")

  == Regression Potential ==
  Low. Fixes current regression.  Cc'd to upstream stable, so it has had
  additon upstream review.

  == Test Case ==
  A test kernel was built with this patch and tested by the original bug 
reporter.
  The bug reporter states the test kernel resolved the bug.


  Recently we discovered this bug occurs just alone with stop4 which
  results in soft lockups/rcu stalls.

  ```
  root@ltc-boston125:~# [15523.619395] systemd[1]: systemd-journald.service: 
Processes still around after final SIGKILL. Entering failed mode.
  [15523.619508] systemd[1]: systemd-journald.service: Failed with result 
'timeout'.
  [15523.619769] systemd[1]: Failed to start Journal Service.
  [15523.620618] systemd[1]: systemd-journald.service: Service has no hold-off 
time, scheduling restart.
  [15523.620774] systemd[1]: systemd-journald.service: Scheduled restart job, 
restart counter is at 21.
  [15523.621462] systemd[1]: Stopped Journal Service.
  [15523.621635] systemd[1]: systemd-journald.service: Found left-over process 
1561 (systemd-journal) in control group while starting unit. Ignoring.
  [15523.621756] systemd[1]: This usually indicates unclean termination of a 
previous run, or service implementation deficiencies.
  [15523.621888] systemd[1]: systemd-journald.service: Found left-over process 
69060 (systemd-journal) in control group while starting unit. Ignoring.
  [15523.622029] systemd[1]: This usually indica[15541.629904] INFO: rcu_sched 
self-detected stall on CPU
  [15541.629958]        60-....: (2 GPs behind) idle=146/140000000000002/0 
softirq=300022/300022 fqs=999069
  [15541.630046]         (t=2415546 jiffies g=184827 c=184826 q=57111)
  [15541.630101] NMI backtrace for cpu 60
  [15541.630135] CPU: 60 PID: 4810 Comm: tlbie_test Tainted: G             L   
4.15.0-15-generic #16-Ubuntu
  [15541.630207] Call Trace:
  [15541.630232] [c000201a1da96b00] [c000000000ceb35c] dump_stack+0xb0/0xf4 
(unreliable)
  [15541.630298] [c000201a1da96b40] [c000000000cf4d48] 
nmi_cpu_backtrace+0x1f8/0x200
  [15541.630363] [c000201a1da96bd0] [c000000000cf4ee8] 
nmi_trigger_cpumask_backtrace+0x198/0x1f0
  [15541.630429] [c000201a1da96c60] [c00000000002f2d8] 
arch_trigger_cpumask_backtrace+0x28/0x40
  [15541.630495] [c000201a1da96c80] [c0000000001a913c] 
rcu_dump_cpu_stacks+0xf4/0x158
  [15541.630560] [c000201a1da96cd0] [c0000000001a81e8] 
rcu_check_callbacks+0x8e8/0xb40
  [15541.630625] [c000201a1da96e00] [c0000000001b64a8] 
update_process_times+0x48/0x90
  [15541.630689] [c000201a1da96e30] [c0000000001ce1f4] 
tick_sched_handle.isra.5+0x34/0xd0
  [15541.630753] [c000201a1da96e60] [c0000000001ce2f0] 
tick_sched_timer+0x60/0xe0
  [15541.630818] [c000201a1da96ea0] [c0000000001b7054] 
__hrtimer_run_queues+0x144/0x370
  [15541.630883] [c000201a1da96f20] [c0000000001b7fac] 
hrtimer_interrupt+0xfc/0x350
  [15541.630948] [c000201a1da96ff0] [c0000000000248f0] 
__timer_interrupt+0x90/0x260
  [15541.631013] [c000201a1da97040] [c000000000024d08] timer_interrupt+0x98/0xe0
  [15541.631069] [c000201a1da97070] [c000000000009014] 
decrementer_common+0x114/0x120
  [15541.631135] --- interrupt: 901 at smp_call_function_single+0x134/0x180
  [15541.631135]     LR = smp_call_function_single+0x110/0x180
  [15541.631230] [c000201a1da973d0] [c0000000001d55e0] 
smp_call_function_any+0x180/0x250
  [15541.631294] [c000201a1da97430] [c000000000acd3e8] 
gpstate_timer_handler+0x1e8/0x580
  [15541.631359] [c000201a1da974e0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0
  [15541.631433] [c000201a1da97560] [c0000000001b4958] expire_timers+0x138/0x1f0
  [15541.631488] [c000201a1da975d0] [c0000000001b4bf8] 
run_timer_softirq+0x1e8/0x270
  [15541.631553] [c000201a1da97670] [c000000000d0d6c8] __do_softirq+0x158/0x3e4
  [15541.631608] [c000201a1da97750] [c000000000114be8] irq_exit+0xe8/0x120
  [15541.631663] [c000201a1da97770] [c000000000024d0c] timer_interrupt+0x9c/0xe0
  [15541.631718] [c000201a1da977a0] [c000000000009014] 
decrementer_common+0x114/0x120
  [15541.631784] --- interrupt: 901 at smp_call_function_many+0x330/0x450
  [15541.631784]     LR = smp_call_function_many+0x324/0x450
  [15541.631879] [c000201a1da97b00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
  [15541.631935] [c000201a1da97b30] [c0000000003a1120] 
change_huge_pmd+0xe0/0x270
  [15541.632000] [c000201a1da97ba0] [c000000000349278] 
change_protection_range+0xb88/0xe40
  [15541.632065] [c000201a1da97cf0] [c0000000003496c0] 
mprotect_fixup+0x140/0x340
  [15541.632129] [c000201a1da97db0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
  [15541.632185] [c000201a1da97e30] [c00000000000b184] system_call+0x58/0x6c
  [15579.001651] watchdog: BUG: soft lockup - CPU#52 stuck for 23s! [grep:69263]
  [15579.001738] Modules linked in: vhost_net vhost tap xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter devlink input_leds joydev mac_hid idt_89hpesx ofpart 
cmdlinepart powernv_flash ipmi_powernv ipmi_devintf opal_prd mtd 
ipmi_msghandler ibmpowernv at24 uio_pdrv_genirq uio vmx_crypto kvm_hv kvm 
sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure ast
  [15579.002363]  i2c_algo_bit hid_generic ttm drm_kms_helper mpt3sas 
syscopyarea sysfillrect usbhid sysimgblt fb_sys_fops hid raid_class 
crct10dif_vpmsum crc32c_vpmsum drm i40e aacraid scsi_transport_sas
  [15579.002524] CPU: 52 PID: 69263 Comm: grep Tainted: G             L   
4.15.0-15-generic #16-Ubuntu
  [15579.002598] NIP:  c0000000001d5368 LR: c0000000001d5340 CTR: 
c000000000acc7f0
  [15579.002664] REGS: c000003e84eff7e0 TRAP: 0901   Tainted: G             L   
 (4.15.0-15-generic)
  [15579.002735] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48044222 
 XER: 00000000
  [15579.002810] CFAR: c01721ed8
  [15579.002810] GPR08: c000000001721ed8 0000000000000001 c009e006592e0960 
0000000000000000
  [15579.002810] GPR12: c000000000acc7f0 c00000000faa3c00
  [15579.003084] NIP [c0000000001d5368] smp_call_function_single+0x138/0x180
  [15579.003139] LR [c0000000001d5340] smp_call_function_single+0x110/0x180
  [15579.003191] Call Trace:
  [15579.003217] [c000003e84effa60] [c0000000001d5340] 
smp_call_function_single+0x110/0x180 (unreliable)
  [15579.003298] [c000003e84effad0] [c0000000001d55e0] 
smp_call_function_any+0x180/0x250
  [15579.003381] [c000003e84effb30] [c000000000acc840] 
powernv_cpufreq_get+0x50/0x70
  [15579.003447] [c000003e84effb60] [c000000000ac2b8c] __cpufreq_get+0x5c/0x140
  [15579.003503] [c000003e84effba0] [c000000000ac2d18] cpufreq_get+0xa8/0xb0
  [15579.003560] [c000003e84effbe0] [c00000000009da50] 
pnv_get_proc_freq+0x20/0x50
  [15579.003625] [c000003e84effc00] [c0000000000283bc] show_cpuinfo+0x11c/0x400
  [15579.003680] [c000003e84effca0] [c00000000040c738] seq_read+0x138/0x610
  [15579.003737] [c000003e84effd40] [c00000000047fa38] proc_reg_read+0x88/0xd0
  [15579.003794] [c000003e84effd70] [c0000000003d293c] __vfs_read+0x3c/0x70
  [15579.003849] [c000003e84effd90] [c0000000003d2a2c] vfs_read+0xbc/0x1b0
  [15579.003905] [c000003e84effde0] [c0000000003d3028] SyS_read+0x68/0x110
  [15579.003962] [c000003e84effe30] [c00000000000b184] system_call+0x58/0x6c
  [15579.004016] Instruction dump:
  [15579.004051] 7fe4fb78 4bfffd4d 813f0018 71290001 4182002c 48000014 60000000 
60000000
  [15579.004121] 60000000 60420000 7c210b78 7c421378 <813f0018> 71290001 
4082fff0 7c2004ac
  [15604.648202] INFO: rcu_sched self-detected stall on CPU
  [15604.648260]        60-....: (2 GPs behind) idle=146/140000000000002/0 
softirq=300022/300022 fqs=1005652
  [15604.648332]         (t=2431300 jiffies g=184827 c=184826 q=57308)
  [15604.648385] NMI backtrace for cpu 60
  [15604.648419] CPU: 60 PID: 4810 Comm: tlbie_test Tainted: G             L   
4.15.0-15-generic #16-Ubuntu
  [15604.648491] Call Trace:
  [15604.648515] [c000201a1da96b00] [c000000000ceb35c] dump_stack+0xb0/0xf4 
(unreliable)
  [15604.648581] [c000201a1da96b40] [c000000000cf4d48] 
nmi_cpu_backtrace+0x1f8/0x200
  [15604.648647] [c000201a1da96bd0] [c000000000cf4ee8] 
nmi_trigger_cpumask_backtrace+0x198/0x1f0
  [15604.648728] [c000201a1da96c60] [c00000000002f2d8] 
arch_trigger_cpumask_backtrace+0x28/0x40
  [15604.648793] [c000201a1da96c80] [c0000000001a913c] 
rcu_dump_cpu_stacks+0xf4/0x158
  [15604.648858] [c000201a1da96cd0] [c0000000001a81e8] 
rcu_check_callbacks+0x8e8/0xb40
  [15604.648924] [c000201a1da96e00] [c0000000001b64a8] 
update_process_times+0x48/0x90
  [15604.648988] [c000201a1da96e30] [c0000000001ce1f4] 
tick_sched_handle.isra.5+0x34/0xd0
  [15604.649052] [c000201a1da96e60] [c0000000001ce2f0] 
tick_sched_timer+0x60/0xe0
  [15604.649118] [c000201a1da96ea0] [c0000000001b7054] 
__hrtimer_run_queues+0x144/0x370
  [15604.649183] [c000201a1da96f20] [c0000000001b7fac] 
hrtimer_interrupt+0xfc/0x350
  [15604.649248] [c000201a1da96ff0] [c0000000000248f0] 
__timer_interrupt+0x90/0x260
  [15604.649313] [c000201a1da97040] [c000000000024d08] timer_interrupt+0x98/0xe0
  [15604.649369] [c000201a1da97070] [c000000000009014] 
decrementer_common+0x114/0x120
  [15604.649435] --- interrupt: 901 at smp_call_function_single+0x138/0x180
  [15604.649435]     LR = smp_call_function_single+0x110/0x180
  [15604.649530] [c000201a1da973d0] [c0000000001d55e0] 
smp_call_function_any+0x180/0x250
  [15604.649595] [c000201a1da97430] [c000000000acd3e8] 
gpstate_timer_handler+0x1e8/0x580
  [15604.649660] [c000201a1da974e0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0
  [15604.649715] [c000201a1da97560] [c0000000001b4958] expire_timers+0x138/0x1f0
  [15604.649770] [c000201a1da975d0] [c0000000001b4bf8] 
run_timer_softirq+0x1e8/0x270
  [15604.649835] [c000201a1da97670] [c000000000d0d6c8] __do_softirq+0x158/0x3e4
  [15604.649891] [c000201a1da97750] [c000000000114be8] irq_exit+0xe8/0x120
  [15604.649946] [c000201a1da97770] [c000000000024d0c] timer_interrupt+0x9c/0xe0
  [15604.650002] [c000201a1da977a0] [c000000000009014] 
decrementer_common+0x114/0x120
  [15604.650084] --- interrupt: 901 at smp_call_function_many+0x330/0x450
  [15604.650084]     LR = smp_call_function_many+0x324/0x450
  [15604.650179] [c000201a1da97b00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
  [15604.650235] [c000201a1da97b30] [c0000000003a1120] 
change_huge_pmd+0xe0/0x270
  [15604.650301] [c000201a1da97ba0] [c000000000349278] 
change_protection_range+0xb88/0xe40
  [15604.650366] [c000201a1da97cf0] [c0000000003496c0] 
mprotect_fixup+0x140/0x340
  [15604.650430] [c000201a1da97db0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
  [15604.650486] [c000201a1da97e30] [c00000000000b184] system_call+0x58/0x6c
  [15667.666494] INFO: rcu_sched self-detected stall on CPU
  [15667.666550]        60-....: (2 GPs behind) idle=146/140000000000002/0 
softirq=300022/300022 fqs=1012258
  [15667.666622]         (t=2447054 jiffies g=184827 c=184826 q=57457)
  [15667.666675] NMI backtrace for cpu 60
  [15667.666709] CPU: 60 PID: 4810 Comm: tlbie_test Tainted: G             L   
4.15.0-15-generic #16-Ubuntu
  [15667.666781] Call Trace:
  [15667.666805] [c000201a1da96b00] [c000000000ceb35c] dump_stack+0xb0/0xf4 
(unreliable)
  [15667.666871] [c000201a1da96b40] [c000000000cf4d48] 
nmi_cpu_backtrace+0x1f8/0x200
  [15667.666937] [c000201a1da96bd0] [c000000000cf4ee8] 
nmi_trigger_cpumask_backtrace+0x198/0x1f0
  [15667.667002] [c000201a1da96c60] [c00000000002f2d8] 
arch_trigger_cpumask_backtrace+0x28/0x40
  [15667.667086] [c000201a1da96c80] [c0000000001a913c] 
rcu_dump_cpu_stacks+0xf4/0x158
  [15667.667151] [c000201a1da96cd0] [c0000000001a81e8] 
rcu_check_callbacks+0x8e8/0xb40
  [15667.667216] [c000201a1da96e00] [c0000000001b64a8] 
update_process_times+0x48/0x90
  [15667.667280] [c000201a1da96e30] [c0000000001ce1f4] 
tick_sched_handle.isra.5+0x34/0xd0
  [15667.667344] [c000201a1da96e60] [c0000000001ce2f0] 
tick_sched_timer+0x60/0xe0
  [15667.667409] [c000201a1da96ea0] [c0000000001b7054] 
__hrtimer_run_queues+0x144/0x370
  [15667.667474] [c000201a1da96f20] [c0000000001b7fac] 
hrtimer_interrupt+0xfc/0x350
  [15667.667539] [c000201a1da96ff0] [c0000000000248f0] 
__timer_interrupt+0x90/0x260
  [15667.667604] [c000201a1da97040] [c000000000024d08] timer_interrupt+0x98/0xe0
  [15667.667660] [c000201a1da97070] [c000000000009014] 
decrementer_common+0x114/0x120
  [15667.667727] --- interrupt: 901 at smp_call_function_single+0x130/0x180
  [15667.667727]     LR = smp_call_function_single+0x110/0x180
  [15667.667821] [c000201a1da973d0] [c0000000001d55e0] 
smp_call_function_any+0x180/0x250
  [15667.667886] [c000201a1da97430] [c000000000acd3e8] 
gpstate_timer_handler+0x1e8/0x580
  [15667.667951] [c000201a1da974e0] [c0000000001b46b0] call_timer_fn+0x50/0x1c0
  [15667.668006] [c000201a1da97560] [c0000000001b4958] expire_timers+0x138/0x1f0
  [15667.668061] [c000201a1da975d0] [c0000000001b4bf8] 
run_timer_softirq+0x1e8/0x270
  [15667.668126] [c000201a1da97670] [c000000000d0d6c8] __do_softirq+0x158/0x3e4
  [15667.668181] [c000201a1da97750] [c000000000114be8] irq_exit+0xe8/0x120
  [15667.668236] [c000201a1da97770] [c000000000024d0c] timer_interrupt+0x9c/0xe0
  [15667.668292] [c000201a1da977a0] [c000000000009014] 
decrementer_common+0x114/0x120
  [15667.668358] --- interrupt: 901 at smp_call_function_many+0x330/0x450
  [15667.668358]     LR = smp_call_function_many+0x324/0x450
  [15667.668469] [c000201a1da97b00] [c000000000075f18] pmdp_invalidate+0x98/0xe0
  [15667.668524] [c000201a1da97b30] [c0000000003a1120] 
change_huge_pmd+0xe0/0x270
  [15667.668589] [c000201a1da97ba0] [c000000000349278] 
change_protection_range+0xb88/0xe40
  [15667.668654] [c000201a1da97cf0] [c0000000003496c0] 
mprotect_fixup+0x140/0x340
  [15667.668719] [c000201a1da97db0] [c000000000349a74] SyS_mprotect+0x1b4/0x350
  [15667.668775] [c000201a1da97e30] [c00000000000b184] system_call+0x58/0x6c

  ```

  Per feedback from Vaidy, this currently appears to NOT be a firmware
  problem. This seems to be a kernel synchronization issue leading to a
  dead lock.

  -------
  Fix identified by Shilpa as per Nick Piggin's recommendation.  Kernel fix is 
currently being tested.

   -------
  Fix upstream in 4.17-rc3

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3&id=c0f7f5b6c69107ca92909512533e70258ee19188
  cpufreq: powernv: Fix hardlockup due to synchronous smp_call in timer 
interrupt

  Posted to stable as well.

  Mirroring to Launchpad for Canonical to pull in commit.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1768898/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to