------- Comment From sarah.han...@us.ibm.com 2016-10-28 11:23 EDT-------
What's the next step for this bug?

------- Comment From s...@us.ibm.com 2016-10-28 11:28 EDT-------
Thank you.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1632471

Title:
  Briggs: ISST-LTE:UbuntuKVM: system hangs, console shows all of these
  messages  raw_spin_lock_irqsave, rcu_sched, rcu_bh detected stalls on
  CPUs #166

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Yakkety:
  Fix Released

Bug description:
  == Comment: #0 - Chanh H. Nguyen <chngu...@us.ibm.com> - 2016-08-15 11:56:22 
==
  We have the SuperMicro system "Briggs" that have 2 Qlogic cards on it  (16Gb 
and 8Gb) connect to the V7000 storage. 
  We install Ubuntu 16.04.1 kernel -34 on it and run test BASE/IO over weekend 
on the bare metal mode . System seems to hang now, it responses to ping only, 
nothing else. 
  Open the console, it doesn't give out the login promt except all of these 
errors print out .......
  [323294.686996] INFO: rcu_bh detected stalls on CPUs/tasks:
  [323294.687105]               3-...: (1 GPs behind) 
idle=813/140000000000000/0 softirq=9880043/9935887 fqs=48656494
  [323294.687187]               71-...: (1 GPs behind) 
idle=a99/140000000000000/0 softirq=9875483/9935684 fqs=48656494
  [323294.687268]               153-...: (1 GPs behind) 
idle=321/140000000000000/0 softirq=3948104/3976836 fqs=48656494
  [323294.687346]               (detected by 75, t=48672447 jiffies, g=79, 
c=78, q=706)
  [323294.687427] Task dump for CPU 3:
  [323294.687462] gmond           R  running task        0 16394      1 
0x00040004
  [323294.687542] Call Trace:
  [323294.687568] [c000003fb3bdb700] [c000003fb3bdb740] 0xc000003fb3bdb740 
(unreliable)
  [323294.687648] [c000003fb3bdb8d0] [c000003fec8de900] 0xc000003fec8de900
  [323294.687716] [c000003fb3bdb920] [0000000000000020] 0x20
  [323294.687773] [c000003fb3bdb960] [c000003fb3bdb9b0] 0xc000003fb3bdb9b0
  [323294.687842] [c000003fb3bdb9b0] [c0000000000f9d14] 
task_sched_runtime+0xc4/0x250
  [323294.687922] [c000003fb3bdba20] [c0000000001003d8] 
thread_group_cputime+0xc8/0x1c0
  [323294.688002] [c000003fb3bdba80] [c000000000100720] 
thread_group_cputime_adjusted+0x40/0x80
  [323294.688082] [c000003fb3bdbae0] [c000000000385428] do_task_stat+0x2b8/0xaf0
  [323294.688152] [c000003fb3bdbc60] [c00000000037c710] 
proc_single_show+0x80/0x100
  [323294.688232] [c000003fb3bdbcb0] [c000000000317808] seq_read+0xe8/0x530
  [323294.688302] [c000003fb3bdbd50] [c0000000002e170c] __vfs_read+0x6c/0xe0
  [323294.688412] [c000003fb3bdbd90] [c0000000002e2464] vfs_read+0xa4/0x1c0
  [323294.688481] [c000003fb3bdbde0] [c0000000002e356c] SyS_read+0x6c/0x110
  [323294.688550] [c000003fb3bdbe30] [c000000000009204] system_call+0x38/0xb4
  [323294.688618] Task dump for CPU 71:
  [323294.688653] fs_inod         R  running task        0 151173  78575 
0x00040004
  [323294.688735] Call Trace:
  [323294.688760] [c000003fbcd17890] [c000005cc0adee00] 0xc000005cc0adee00 
(unreliable)
  [323294.688840] Task dump for CPU 153:
  [323294.688876] fs_inod         D 00003fffafd1efb8     0 151172  79296 
0x00040004
  [323294.688956] Call Trace:
  [323294.688981] [c000005a398675e0] [00000000004352ca] 0x4352ca (unreliable)
  [323294.689049] [c000005a398677b0] [c000003feee5e900] 0xc000003feee5e900
  [323294.689118] [c000005a39867800] [c000000000ae8604] __schedule+0x314/0x990
  [323294.689187] [c000005a398678d0] [c000000000ae8cc8] schedule+0x48/0xc0
  [323294.689257] [c000005a39867900] [c000000000aecc08] 
rwsem_down_write_failed+0x2b8/0x410
  [323294.689337] [c000005a39867990] [c000000000aebdf8] down_write+0x78/0xa0
  [323294.689407] [c000005a398679c0] [c0000000002797bc] 
unlink_file_vma+0x4c/0x90
  [323294.689475] [c000005a39867a00] [c000000000270110] 
free_pgtables+0x100/0x1c0
  [323294.689543] [c000005a39867a50] [c00000000027d4e0] exit_mmap+0x100/0x1a0
  [323294.689612] [c000005a39867b10] [c0000000000b5098] mmput+0xa8/0x1b0
  [323294.689681] [c000005a39867b40] [c0000000002ec740] 
flush_old_exec+0x680/0x840
  [323294.689792] [c000005a39867bb0] [c00000000036412c] 
load_elf_binary+0x43c/0xf90
  [323294.689872] [c000005a39867c90] [c0000000002ecbb8] 
search_binary_handler+0xe8/0x2a0
  [323294.689951] [c000005a39867d20] [c0000000002edad0] 
do_execveat_common.isra.14+0x600/0x850
  [323294.690031] [c000005a39867df0] [c0000000002ee13c] SyS_execve+0x5c/0x80
  [323294.690098] [c000005a39867e30] [c000000000009204] system_call+0x38/0xb4
  [323294.926994] INFO: rcu_sched self-detected stall on CPU
  [323294.927058]               113-...: (48720536 ticks this GP) 
idle=df7/140000000000001/0 softirq=3730603/3730614 fqs=48343855
  [323294.927138]                (t=48734941 jiffies g=6962468 c=6962467 
q=43612779)
  [323294.927210] Task dump for CPU 71:
  [323294.927241] fs_inod         R  running task        0 151173  78575 
0x00040004
  [323294.927312] Call Trace:
  [323294.927333] [c000003fbcd17890] [c000005cc0adee00] 0xc000005cc0adee00 
(unreliable)
  [323294.927405] Task dump for CPU 113:
  [323294.927436] fs_inod         R  running task        0 46071 148792 
0x00042004
  [323294.927507] Call Trace:
  [323294.927528] [c000003fcb0134b0] [c0000000000fbcc0] 
sched_show_task+0xe0/0x180 (unreliable)
  [323294.927599] [c000003fcb013520] [c00000000013ec34] 
rcu_dump_cpu_stacks+0xe4/0x150
  [323294.927669] [c000003fcb013570] [c000000000144364] 
rcu_check_callbacks+0x6b4/0x9b0
  [323294.927739] [c000003fcb0136a0] [c00000000014c1c8] 
update_process_times+0x58/0xa0
  [323294.927809] [c000003fcb0136d0] [c0000000001638d8] 
tick_sched_handle.isra.6+0x48/0xe0
  [323294.927880] [c000003fcb013710] [c0000000001639d4] 
tick_sched_timer+0x64/0xd0
  [323294.927940] [c000003fcb013750] [c00000000014cc94] 
__hrtimer_run_queues+0x124/0x450
  [323294.928011] [c000003fcb0137e0] [c00000000014dcbc] 
hrtimer_interrupt+0xec/0x2c0
  [323294.928082] [c000003fcb0138a0] [c00000000001f5bc] 
__timer_interrupt+0x8c/0x290
  [323294.928152] [c000003fcb0138f0] [c00000000001f970] 
timer_interrupt+0xa0/0xe0
  [323294.928253] [c000003fcb013920] [c000000000002714] 
decrementer_common+0x114/0x180
  [323294.928324] --- interrupt: 901 at _raw_spin_lock_irqsave+0xa4/0x130
  [323294.928324]     LR = _raw_spin_lock_irqsave+0x9c/0x130
  [323294.928413] [c000003fcb013c10] [c000007942f936b0] 0xc000007942f936b0 
(unreliable)
  [323294.928484] [c000003fcb013c50] [c000000000124dbc] rwsem_wake+0xcc/0x110
  [323294.928545] [c000003fcb013c90] [c000000000121e58] up_write+0x78/0x90
  [323294.928604] [c000003fcb013cc0] [c0000000000b6f4c] 
copy_process.isra.5+0x152c/0x1870
  [323294.928675] [c000003fcb013da0] [c0000000000b7468] _do_fork+0xa8/0x4b0
  [323294.928736] [c000003fcb013e30] [c000000000009460] ppc_clone+0x8/0xc
  [323294.928799] Task dump for CPU 153:
  [323294.928829] fs_inod         D 00003fffafd1efb8     0 151172  79296 
0x00040004
  [323294.928901] Call Trace:
  [323294.928922] [c000005a398675e0] [00000000004352ca] 0x4352ca (unreliable)
  [323294.928982] [c000005a398677b0] [c000003feee5e900] 0xc000003feee5e900
  [323294.929043] [c000005a39867800] [c000000000ae8604] __schedule+0x314/0x990
  [323294.929103] [c000005a398678d0] [c000000000ae8cc8] schedule+0x48/0xc0
  [323294.929165] [c000005a39867900] [c000000000aecc08] 
rwsem_down_write_failed+0x2b8/0x410
  [323294.929236] [c000005a39867990] [c000000000aebdf8] down_write+0x78/0xa0
  [323294.929298] [c000005a398679c0] [c0000000002797bc] 
unlink_file_vma+0x4c/0x90
  [323294.929358] [c000005a39867a00] [c000000000270110] 
free_pgtables+0x100/0x1c0
  [323294.929419] [c000005a39867a50] [c00000000027d4e0] exit_mmap+0x100/0x1a0
  [323294.929478] [c000005a39867b10] [c0000000000b5098] mmput+0xa8/0x1b0
  [323294.929540] [c000005a39867b40] [c0000000002ec740] 
flush_old_exec+0x680/0x840
  [323294.929638] [c000005a39867bb0] [c00000000036412c] 
load_elf_binary+0x43c/0xf90
  [323294.929709] [c000005a39867c90] [c0000000002ecbb8] 
search_binary_handler+0xe8/0x2a0
  [323294.929779] [c000005a39867d20] [c0000000002edad0] 
do_execveat_common.isra.14+0x600/0x850
  [323294.929848] [c000005a39867df0] [c0000000002ee13c] SyS_execve+0x5c/0x80
  [323294.931065] [c000005a39867e30] [c000000000009204] 
system_call+0x38/0xb4[323294.934998] INFO: rcu_sched detected stalls on 
CPUs/tasks:
  [323294.935005]               71-...: (4 GPs behind) 
idle=a99/140000000000000/0 softirq=9935677/9935684 fqs=48343856

  [323294.953222]
  [323294.954627]               113-...: (48720536 ticks this GP) 
idle=df7/140000000000001/0 softirq=3730603/3730614 fqs=48343856
  [323294.964332]               153-...: (2 GPs behind) 
idle=321/140000000000000/0 softirq=3976834/3976836 fqs=48343857
  [323294.974024]               (detected by 17, t=48734953 jiffies, g=6962468, 
c=6962467, q=43612782)
  [323294.980967] Task dump for CPU 71:
  [323294.985119] fs_inod         R  running task        0 151173  78575 
0x00040004
  [323294.992058] Call Trace:
  [323294.994823] [c000003fbcd17890] [c000005cc0adee00] 0xc000005cc0adee00 
(unreliable)
  [323295.001767] Task dump for CPU 113:
  [323295.005916] fs_inod         R  running task        0 46071 148792 
0x00042004
  [323295.012851] Call Trace:
  [323295.015627] [c000003fcb013a10] [0000000010160000] 0x10160000 (unreliable)
  [323295.022572] Task dump for CPU 153:
  [323295.025342] fs_inod         D 00003fffafd1efb8     0 151172  79296 
0x00040004
  [323295.032259] Call Trace:
  [323295.035044] [c000005a398675e0] [00000000004352ca] 0x4352ca (unreliable)
  [323295.041977] [c000005a398677b0] [c000003feee5e900] 0xc000003feee5e900
  [323295.048921] [c000005a39867800] [c000000000ae8604] __schedule+0x314/0x990
  [323295.055839] [c000005a398678d0] [c000000000ae8cc8] schedule+0x48/0xc0
  [323295.061404] [c000005a39867900] [c000000000aecc08] 
rwsem_down_write_failed+0x2b8/0x410
  [323295.069715] [c000005a39867990] [c000000000aebdf8] down_write+0x78/0xa0
  [323295.076640] [c000005a398679c0] [c0000000002797bc] 
unlink_file_vma+0x4c/0x90
  [323295.083584] [c000005a39867a00] [c000000000270110] 
free_pgtables+0x100/0x1c0
  [323295.090520] [c000005a39867a50] [c00000000027d4e0] exit_mmap+0x100/0x1a0
  [323295.097434] [c000005a39867b10] [c0000000000b5098] mmput+0xa8/0x1b0
  [323295.103006] [c000005a39867b40] [c0000000002ec740] 
flush_old_exec+0x680/0x840
  [323295.109929] [c000005a39867bb0] [c00000000036412c] 
load_elf_binary+0x43c/0xf90
  [323295.116877] [c000005a39867c90] [c0000000002ecbb8] 
search_binary_handler+0xe8/0x2a0
  [323295.125178] [c000005a39867d20] [c0000000002edad0] 
do_execveat_common.isra.14+0x600/0x850
  [323295.133501] [c000005a39867df0] [c0000000002ee13c] SyS_execve+0x5c/0x80
  [323295.139059] [c000005a39867e30] [c000000000009204] system_call+0x38/0xb4
  [323357.706980] INFO: rcu_bh detected stalls on CPUs/tasks:

  == Comment: #16 - Fei Fei Xu <feifei...@cn.ibm.com> - 2016-08-24 01:01:19 ==
  Some findings from console log:
  1) rcu_bh detected stalls were detected on CPU 3, CPU71, CPU113 and CPU 153. 

  [323294.686996] INFO: rcu_bh detected stalls on CPUs/tasks:
  [323294.687105]               3-...: (1 GPs behind) 
idle=813/140000000000000/0 softirq=9880043/9935887 fqs=48656494
  [323294.687187]               71-...: (1 GPs behind) 
idle=a99/140000000000000/0 softirq=9875483/9935684 fqs=48656494
  [323294.687268]               153-...: (1 GPs behind) 
idle=321/140000000000000/0 softirq=3948104/3976836 fqs=48656494

  [345225.881576]         113-...: (54201466 ticks this GP)
  idle=df7/140000000000001/0 softirq=3730603/3730614 fqs=53769037

  The softirq number behind the "/" stays constant across repeated
  stall-warning messages(9935887, 9935684, 3976836, 3730614). This
  happen when the stalled CPU is spinning with interrupts are disabled.
  Which from the log, CPUs might be spinning in a spin lock.

  From the task dump:

  [321971.602704] Task dump for CPU 153:
  [321971.605481] fs_inod         D 00003fffafd1efb8     0 151172  79296 
0x00040004

  task "fs_inod" (should be a task in BASE/IO test case) running on 153
  is waiting for something.  This might be the root task causing the
  locks on CPUs.

  Notice a warning :
  nfs: server 10.33.11.31 not responding, timed out

  Seems the BASE/IO tests run the cases mounting through NFS. fs_inod
  might be waiting for someting read/write from/to a non-responding NFS
  server. And rcu stall happens.

  So can we make sure the networking to the NFS server is stable, and
  then check if the issue can be reproduced?

  == Comment: #20 - Paul Mackerras <p...@au1.ibm.com> - 2016-08-31 00:30:36 ==
  Seeing some CPUs stuck in rwsem_wake() and others stuck in schedule() from 
down_write() makes me wonder if this could be the same issue as that described 
in https://lkml.org/lkml/2016/8/30/114 (which includes a proposed fix).

  == Comment: #25 - Balbir Singh <balb...@au1.ibm.com> - 2016-09-13 20:48:20 ==
  (In reply to comment #23)

  A pull request was sent out yesterday for 4.8
  (https://lkml.org/lkml/2016/9/13/589). Once that is done, we can
  request any backport. The backport is straight forward, I presume a
  mirroring of this bug with a request  will do the needful?

  == Comment: #31 - Leonardo Augusto Guimaraes Garcia <lagar...@br.ibm.com> - 
2016-10-11 10:56:36 ==
  What's the next step here? Should we mirror this bug so that Canonical gets 
the fix?

  == Comment: #33 - Scott E. Garfinkle <s...@us.ibm.com> - 2016-10-11 12:14:24 
==
  Yes, this needs to get mirrored to Canonical 16.04 "kernel" component. I made 
an effort to set the correct comments to be mirrored or not.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632471/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to