------- Comment From boqun.f...@cn.ibm.com 2017-02-06 03:07 EDT-------
(In reply to comment #71)
> Hi,
>
> Ran runltp on 17.04, and ended up with continuous call traces on console.
>
> ---uname output---
> Linux ltc-garri1 4.9.0-15-generic #16-Ubuntu SMP Fri Jan 20 15:28:49 UTC
> 2017 ppc64le ppc64le ppc64le GNU/Linux
>
> ---Call Trace---
> [332850.158228] Call Trace:
> [332850.158253] [c000001fcec4fb50] [c00000000001c3a0] __switch_to+0x2e0/0x4c0
> [332850.159375] [c000001fcec4fbb0] [c000000000b193a0] __schedule+0x300/0x990
> [332850.159872] [c000001fcec4fc90] [c000000000b19e88] _cond_resched+0x78/0xa0
> [332850.160475] [c000001fcec4fcc0] [d00000001c810278]
> torture_rwlock_write_lock_irq+0x30/0x60 [locktorture]

This one is weird, because torture_rwlock_write_lock_irq will acquire a
lock with irq disabled, so it will not get any change to reschedule..

And the asm code for
/lib/modules/4.9.0-15-generic/kernel/kernel/locking/locktorture.ko:

0000000000000248 <torture_rwlock_write_lock_irq>:
248:       f8 ff 4c e8     ld      r2,-8(r12)
248: R_PPC64_ENTRY      *ABS*
24c:       14 62 42 7c     add     r2,r2,r12
250:       a6 02 08 7c     mflr    r0
254:       01 00 00 48     bl      254 <torture_rwlock_write_lock_irq+0xc>
254: R_PPC64_REL24      _mcount
258:       a6 02 08 7c     mflr    r0
25c:       f8 ff e1 fb     std     r31,-8(r1)
260:       10 00 01 f8     std     r0,16(r1)
264:       d1 ff 21 f8     stdu    r1,-48(r1)
268:       00 00 22 3d     addis   r9,r2,0
268: R_PPC64_TOC16_HA   .toc
26c:       00 00 e9 eb     ld      r31,0(r9)
26c: R_PPC64_TOC16_LO_DS        .toc
270:       30 00 7f 38     addi    r3,r31,48
274:       01 00 00 48     bl      274 <torture_rwlock_write_lock_irq+0x2c>
274: R_PPC64_REL24      _raw_write_lock_irqsave
278:       00 00 00 60     nop
27c:       18 00 3f e9     ld      r9,24(r31)
280:       40 00 69 f8     std     r3,64(r9)
284:       00 00 60 38     li      r3,0
288:       30 00 21 38     addi    r1,r1,48

so torture_rwlock_write_lock_irq+0x30/0x60 is

278:       00 00 00 60     nop

right after the return of call to _raw_write_lock_irqsave, which should
be fixed up by the instruction that recover the toc, IIUC. How that end
up being a call to _cond_resched() is still unknown.

Besides, Pooja, I cannot reproduce this in a PowerKVM guest with 16
VCPUs. What's your setup for the machine that hit the problem? And how
many times you need to run the testcase in order to hit one?

Regards,
Boqun

> [332850.161157] [c000001fcec4fcf0] [d00000001c810988]
> lock_torture_writer+0xc0/0x270 [locktorture]
> [332850.170046] [c000001fcec4fd80] [c0000000000f53c0] kthread+0x110/0x130
> [332850.173108] [c000001fcec4fe30] [c00000000000c0e8]
> ret_from_kernel_thread+0x5c/0x74
> [332850.174169] Task dump for CPU 10:
> [332850.174202] lock_torture_wr R  running task        0 37019      2
> 0x00000804
> [332850.174734] Call Trace:
> [332850.175218] [c000001fcf3ab980] [c0000000000d0008] irq_exit+0xe8/0x120
> (unreliable)
> [332850.175813] [c000001fcf3ab9a0] [c000000000024e10]
> timer_interrupt+0xa0/0xe0
> [332850.176438] --- interrupt: d00000001c6008c0 at
> finish_task_switch+0xb0/0x340
> [332850.176438]     LR = stutter_wait+0x8c/0x130 [torture]
> [332850.177950] [c000001fcf3ab9d0] [c000000000009514]
> decrementer_common+0x114/0x180 (unreliable)
> [332850.178517] --- interrupt: 901 at torture_shutdown_init+0x2a4/0x770
> [torture]
> [332850.178517]     LR = stutter_wait+0x8c/0x130 [torture]
> [332850.179557] [c000001fcf3abcc0] [d00000001c810278]
> torture_rwlock_write_lock_irq+0x30/0x60 [locktorture] (unreliable)
> [332850.180675] [c000001fcf3abcf0] [d00000001c810988]
> lock_torture_writer+0xc0/0x270 [locktorture]
> [332850.181646] [c000001fcf3abd80] [c0000000000f53c0] kthread+0x110/0x130
> [332850.194048] [c000001fcf3abe30] [c00000000000c0e8]
> ret_from_kernel_thread+0x5c/0x74
> [332850.196658] Task dump for CPU 12:
> [332850.198840] lock_torture_wr R  running task        0 36921      2
> 0x00000804
> [332850.201377] Call Trace:
> [332850.201402] [c000001fc3747890] [c00000000016127c]
> hrtimer_interrupt+0x11c/0x2a0 (unreliable)
> [332850.206296] [c000001fc3747a60] [c00000000001c3a0] __switch_to+0x2e0/0x4c0
> [332850.208647] [c000001fc3747ac0] [c000000000b193a0] __schedule+0x300/0x990
> [332850.211007] [c000001fc3747b00] [c00000000015bc78]
> lock_timer_base+0xa8/0x100
> [332850.214641] [c000001fc3747b60] [c00000000015bde8]
> try_to_del_timer_sync+0x68/0xa0
> [332850.215236] [c000001fc3747ba0] [c00000000015be90]
> del_timer_sync+0x70/0x90
> [332850.215817] [c000001fc3747bd0] [c000000000b1de14]
> schedule_timeout+0x234/0x470
> [332850.216369] [c000001fc3747cc0] [d00000001c810278]
> torture_rwlock_write_lock_irq+0x30/0x60 [locktorture]
> [332850.217397] [c000001fc3747cf0] [d00000001c810988]
> lock_torture_writer+0xc0/0x270 [locktorture]
> [332850.218433] [c000001fc3747d80] [c0000000000f53c0] kthread+0x110/0x130
> [332850.219013] [c000001fc3747e30] [c00000000000c0e8]
> ret_from_kernel_thread+0x5c/0x74
>
>
> Thanks,
> Pooja

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1483343

Title:
  NMI watchdog: BUG: soft lockup errors when we execute lock_torture_wr
  tests

Status in linux package in Ubuntu:
  Incomplete
Status in linux-lts-xenial package in Ubuntu:
  Incomplete
Status in linux source package in Vivid:
  Incomplete

Bug description:
  ---Problem Description---
  NMI watchdog: BUG: soft lockup errors when we execute lock_torture_wr tests
    
  ---uname output---
  Linux alp15 3.19.0-18-generic #18~14.04.1-Ubuntu SMP Wed May 20 09:40:36 UTC 
2015 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = P8 
    
  ---Steps to Reproduce---
  Install a P8 Power VM LPAR with Ubuntu 14.04.2 ISO.
  Then install the Ubuntu 14.04.3 kernel on the same and reboot.
  Then compile and build the LTP latest test suites on the same.

  root@alp15:~# tar -xvf ltp-full-20150420.tar.bz2
  root@alp15:~# cd ltp-full-20150420/
  root@alp15:~/ltp-full-20150420# ls
  aclocal.m4      configure     execltp.in  install-sh  Makefile          
README                runltplite.sh    testcases    utils
  autom4te.cache  configure.ac  IDcheck.sh  lib         Makefile.release  
README.kernel_config  runtest          testscripts  ver_linux
  config.guess    COPYING       include     ltpmenu     missing           
runalltests.sh        scenario_groups  TODO         VERSION
  config.sub      doc           INSTALL     m4          pan               
runltp                scripts          tools
  root@alp15:~/ltp-full-20150420# ./configure
  root@alp15:~/ltp-full-20150420# make
  root@alp15:~/ltp-full-20150420# make install

  root@alp15:/opt/ltp/testcases/bin# ./lock_torture.sh
  lock_torture 1 TINFO : estimate time 6.00 min
  lock_torture 1 TINFO : spin_lock: running 60 sec...

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034386] NMI watchdog: BUG: soft lockup - CPU#10 stuck 
for 21s! [lock_torture_wr:2337]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034389] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 
22s! [lock_torture_wr:2331]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034394] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 
22s! [lock_torture_wr:2339]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034396] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 
22s! [lock_torture_wr:2346]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034398] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 
21s! [lock_torture_wr:2334]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034410] NMI watchdog: BUG: soft lockup - CPU#11 stuck 
for 22s! [lock_torture_wr:2321]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.034412] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 
22s! [lock_torture_wr:2333]

  Message from syslogd@alp15 at Thu Jun 18 01:23:32 2015 ...
  alp15 vmunix: [  308.038386] NMI watchdog: BUG: soft lockup - CPU#14 stuck 
for 22s! [lock_torture_wr:2327]

   
  Stack trace output:
   root@alp15:~# dmesg | more
  [ 1717.146881] lock_torture_wr R  running task
  [ 1717.146881]
  [ 1717.146885]     0  2555      2 0x00000804
  [ 1717.146887] Call Trace:
  [ 1717.146894] [c000000c7551b820] [c000000c7551b860] 0xc000000c7551b860 
(unreliable)
  [ 1717.146899] [c000000c7551b860] [c0000000000b4fb0] __do_softirq+0x220/0x3b0
  [ 1717.146904] [c000000c7551b960] [c0000000000b5478] irq_exit+0x98/0x100
  [ 1717.146909] [c000000c7551b980] [c00000000001fa54] timer_interrupt+0xa4/0xe0
  [ 1717.146913] [c000000c7551b9b0] [c000000000002758] 
decrementer_common+0x158/0x180
  [ 1717.146922] --- interrupt: 901 at _raw_write_lock+0x68/0xc0
  [ 1717.146922]     LR = torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.146927] [c000000c7551bca0] [c000000c7551bcd0] 0xc000000c7551bcd0 
(unreliable)
  [ 1717.146934] [c000000c7551bcd0] [d00000000d4810b8] 
torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.146939] [c000000c7551bcf0] [d00000000d480578] 
lock_torture_writer+0x98/0x210 [locktorture]
  [ 1717.146944] [c000000c7551bd80] [c0000000000da4d4] kthread+0x114/0x140
  [ 1717.146948] [c000000c7551be30] [c00000000000956c] 
ret_from_kernel_thread+0x5c/0x70
  [ 1717.146951] Task dump for CPU 10:
  [ 1717.146953] lock_torture_wr R  running task        0  2537      2 
0x00000804
  [ 1717.146957] Call Trace:
  [ 1717.146961] [c000000c7557b820] [c000000c7557b860] 0xc000000c7557b860 
(unreliable)
  [ 1717.146966] [c000000c7557b860] [c0000000000b4fb0] __do_softirq+0x220/0x3b0
  [ 1717.146970] [c000000c7557b960] [c0000000000b5478] irq_exit+0x98/0x100
  [ 1717.146975] [c000000c7557b980] [c00000000001fa54] timer_interrupt+0xa4/0xe0
  [ 1717.146979] [c000000c7557b9b0] [c000000000002758] 
decrementer_common+0x158/0x180
  [ 1717.146988] --- interrupt: 901 at _raw_write_lock+0x68/0xc0
  [ 1717.146988]     LR = torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.146993] [c000000c7557bca0] [c000000c7557bcd0] 0xc000000c7557bcd0 
(unreliable)
  [ 1717.147000] [c000000c7557bcd0] [d00000000d4810b8] 
torture_rwlock_write_lock+0x28/0x40 [locktorture]
  [ 1717.147006] [c000000c7557bcf0] [d00000000d480578] 
lock_torture_writer+0x98/0x210 [locktorture]
  [ 1717.147013] [c000000c7557bd80] [c0000000000da4d4] kthread+0x114/0x140
  [ 1717.147017] [c000000c7557be30] [c00000000000956c] 
ret_from_kernel_thread+0x5c/0x70
  [ 1717.147020] Task dump for CPU 17:
  [ 1717.147021] Task dump for CPU 2:
  [ 1717.147028] lock_torture_wr R
  [ 1717.147028] lock_torture_wr R  running task
  [ 1717.147033]   running task        0  2547      2 0x00000804
  [ 1717.147042]     0  2533      2 0x00000804
  [ 1717.147044] Call Trace:
  [ 1717.147045] Call Trace:
  [ 1717.147053] [c000000c732a3820] [c000000c7f688448] 0xc000000c7f688448
  [ 1717.147056] [c000000c7555f820] [c000000c7fa48448] 0xc000000c7fa48448
  [ 1717.147059]  (unreliable)
  [ 1717.147063]  (unreliable)
  [ 1717.147063]
  [ 1717.147067]
  [ 1717.147072] Task dump for CPU 18:
  [ 1717.147073] Task dump for CPU 7:
  [ 1717.147077] lock_torture_wr R  running task
  [ 1717.147082] lock_torture_wr R    0  2555      2 0x00000804
  [ 1717.147088]   running task
  [ 1717.147088] Call Trace:
  [ 1717.147096] [c000000c7551b820] [c000000c7551b860] 0xc000000c7551b860
  [ 1717.147096]     0  2559      2 0x00000804
  [ 1717.147102] Call Trace:
  [ 1717.147105]  (unreliable)

  It is possible that we are missing this commit that fixes a deadlock
  during these tests:

  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=f548d99ef4f5ec8f7080e88ad07c44d16d058ddc

  will check the Ubuntu source shortly as see if this is the case and we
  can suggest building a kernel to see if it helps.

  The apt-get source linux-image- on the test system didn't pull down
  the sources but the kernel being used is close to the one used for
  vivid (3.19.0-25.26) so I pulled down the git source tree for it with
  git clone git://kernel.ubuntu.com/ubuntu/ubuntu-vivid.git and the
  resulting source shows that the patch for the commit mentioned is not
  applied.

  As I basically understand it, the problem that was fixed is that while
  torture_rwlock_read_lock_irq() acquires a read lock on the lock
  called:

  torture_rwlock

  anything that calls the counterpart torture_rwlock_read_unlock_irq()
  to relinquish the read lock instead ends doing a
  write_unlock_irqrestore() on the torture_rwlock() in essence leaving
  the read lock. So when the locktorture module calls something like
  torture_rwlock_write_lock() as we see in the bug description, it will
  block indefinitely as there is at least one lock reader.

  I'll go ahead and mirror this since I pretty confident this is the
  issue (also should affect Vivid).

  We'll have to figure out how to get the sources for the LTS kernel to
  build a test kernel as well.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1483343/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to