------- Comment From cha...@us.ibm.com 2017-04-24 17:35 EDT-------
Actually, I see examples of taking read locks such as:

2627     /* obtain read lock on map */
2628     IREAD_LOCK(ipimap, RDWRLOCK_IMAP);
2629
2630     /* read the iag */
2631     if ((rc = diIAGRead(imap, iagno, &mp))) {
2632         IREAD_UNLOCK(ipimap);

So maybe a

/* acquire inode map lock */
IWRITE_LOCK(ipimap, RDWRLOCK_IMAP);

followed by

/* release the inode map lock */
IWRITE_UNLOCK(ipimap);

is warranted to keep the diUnmount()/diMount() somewhat atomic?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1685899

Title:
  [Ubuntu 17.04] - JFS related call traces and system enters xmon when
  rebooted after installation

Status in linux package in Ubuntu:
  New

Bug description:
  Issue:
  ------------
  JFS related call traces and system enters xmon when rebooted after 
installation 

  Steps to reproduce:
  ---------------------------------
  1 - Install Ubuntu 17.04 the system with 
   - prepboot
  - /root [JFS filesystem]
  - swap space

  2 -After installation when rebooted it gives out call traces like as
  below:

  [    3.895246] Unable to handle kernel paging request for data at address 
0x00000000
  [    3.895278] Faulting instruction address: 0xd000000004c5df1c
  [    3.895284] Oops: Kernel access of bad area, sig: 11 [#1]
  [    3.895287] SMP NR_CPUS=2048 [    3.895288] NUMA 
  [    3.895290] pSeries
  [    3.895292] Modules linked in: ip_tables x_tables autofs4 jfs ibmvscsi 
crc32c_vpmsum
  [    3.895301] CPU: 30 PID: 923 Comm: ureadahead Not tainted 4.9.0-15-generic 
#16-Ubuntu
  [    3.895304] task: c000000381d3c800 task.stack: c000000381fd0000
  [    3.895307] NIP: d000000004c5df1c LR: d000000004c5deb0 CTR: 
c0000000001279d0
  [    3.895310] REGS: c000000381fd3500 TRAP: 0300   Not tainted  
(4.9.0-15-generic)
  [    3.895313] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>[    
3.895322]   CR: 48008804  XER: 00000001
  [    3.895324] CFAR: c000000000008a60 DAR: 0000000000000000 DSISR: 40000000 
SOFTE: 1 
  GPR00: d000000004c5deb0 c000000381fd3780 d000000004c78c28 c0000003802f40f0 
  GPR04: d000000004c6f6f0 d000000004c72b58 0000000000000563 d000000004c78c28 
  GPR08: 0000000000000000 0000000000180e97 0000000000000000 d000000004c6a608 
  GPR12: c0000000001279d0 c00000000fb90e00 0000000000000000 0000000000000000 
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
  GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
  GPR24: 0000000000000000 0000000000001000 0000000000000000 d000000004c72b38 
  GPR28: 0000000000180e97 f000000000e1d5c0 c0000003812af240 c0000003802f40b0 
  NIP [d000000004c5df1c] __get_metapage+0x204/0x6f0 [jfs]
  [    3.895372] LR [d000000004c5deb0] __get_metapage+0x198/0x6f0 [jfs]
  [    3.895374] Call Trace:
  [    3.895378] [c000000381fd3780] [d000000004c5de6c] 
__get_metapage+0x154/0x6f0 [jfs] (unreliable)
  [    3.895384] [c000000381fd3870] [d000000004c4c368] diRead+0x130/0x260 [jfs]
  [    3.895388] [c000000381fd3920] [d000000004c424f4] jfs_iget+0x6c/0x1e0 [jfs]
  [    3.895393] [c000000381fd3950] [d000000004c43adc] jfs_lookup+0xe4/0x100 
[jfs]
  [    3.895398] [c000000381fd3a80] [c00000000032a120] lookup_slow+0xe0/0x240
  [    3.895402] [c000000381fd3b00] [c00000000032e8a8] 
walk_component+0x2d8/0x3f0
  [    3.895406] [c000000381fd3b70] [c00000000032eb94] 
link_path_walk+0x1d4/0x600
  [    3.895409] [c000000381fd3c00] [c000000000330c1c] path_openat+0xbc/0x480
  [    3.895413] [c000000381fd3c80] [c0000000003328ac] do_filp_open+0xec/0x160
  [    3.895417] [c000000381fd3db0] [c00000000031863c] do_sys_open+0x1cc/0x380
  [    3.895421] [c000000381fd3e30] [c00000000000bd84] system_call+0x38/0xe0
  [    3.895424] Instruction dump:
  [    3.895426] 7909f00e 7fc9f214 39200001 f93f0028 fbdf0030 e93d0000 71280800 
41820460 
  [    3.895433] ebdd0030 41920034 e91d0008 e93f0038 <e8e80000> 811e0000 
80e70090 39080001 
  [    3.895441] ---[ end trace c2aa9ba09ea05eac ]---
  [    3.895443] 
  [    4.088560] systemd-journald[925]: Received request to flush runtime 
journal from PID 1
  [    4.362062] crypto_register_alg 'aes' = 0
  [    4.362112] crypto_register_alg 'cbc(aes)' = 0
  [    4.362150] crypto_register_alg 'ctr(aes)' = 0
  [    4.362191] crypto_register_alg 'xts(aes)' = 0
  [    4.366949] pseries_rng: Registering IBM pSeries RNG driver

  When I first connected to the LPAR, it was unresponsive so I restarted
  it from the HMC and surprisingly it came up to the login prompt and I
  logged into the shell. I proceeded to install the matching linux-
  image-4.10.0-15-generic-dbgsym_4.10.0-15.17_ppc64el.ddeb. However, the
  installation of the matching dbgsym wasn't as helpful as I wanted it
  to be. objdump, crash tool, or addr2line wouldn't give me the source
  line correspond to the NIP address.

  I then restarted the LPAR with xmon enabled and it would drop to xmon
  immediately after attempting to remount / and at the same location as
  before at __get_metapage+0x204/0x6f0 [jfs] and again with a
  dereference of 0x0 as the cause of the data exception

  0xd00000000649df54 <__get_metapage+508>:        ld      r8,8(r29)
  0xd00000000649df58 <__get_metapage+512>:        ld      r9,56(r31)
  0xd00000000649df5c <__get_metapage+516>:        ld      r7,0(r8) <-- r8 = 0x0

  Just to see if this was consistent, I rebooted the system another 6
  times and it fell to xmon in the same location most of the time but
  one time it failed on release_metapage() instead.

  [    3.507823] Unable to handle kernel paging request for data at address 
0x00000000
  [    3.507835] Faulting instruction address: 0xd00000000667db5c
  cpu 0x16: Vector: 300 (Data Access) at [c000000639be3520]
      pc: d00000000667db5c: release_metapage+0x174/0x370 [jfs]
      lr: d00000000667da90: release_metapage+0xa8/0x370 [jfs]
      sp: c000000639be37a0
     msr: 800000000280b033
     dar: 0
   dsisr: 40000000
    current = 0xc000000639918c00
    paca    = 0xc00000000fb8c600   softe: 0        irq_happened: 0x01
      pid   = 946, comm = keyboard-setup.
  Linux version 4.10.0-15-generic (buildd@bos01-ppc64el-025) (gcc version 6.3.0 
20170221 (Ubuntu 6.3.0-8ubuntu1) ) #17-Ubuntu SMP
   Fri Mar 24 17:50:37 UTC 2017 (Ubuntu 4.10.0-15.17-generic 4.10.5)
  enter ? for help
  [c000000639be3820] d00000000666c3b4 diRead+0x17c/0x260 [jfs]
  [c000000639be38d0] d0000000066624f4 jfs_iget+0x6c/0x1e0 [jfs]
  [c000000639be3900] d000000006663adc jfs_lookup+0xe4/0x100 [jfs]
  [c000000639be3a30] c00000000034fd60 lookup_slow+0xe0/0x240
  [c000000639be3ab0] c0000000003543d8 walk_component+0x268/0x3f0
  [c000000639be3b30] c000000000354724 link_path_walk+0x1c4/0x5e0
  [c000000639be3bc0] c000000000354c98 path_lookupat+0xa8/0x1a0
  [c000000639be3c00] c000000000357390 filename_lookup+0xa0/0x1a0
  [c000000639be3d30] c000000000347f68 vfs_fstatat+0x88/0x120
  [c000000639be3d90] c0000000003485b8 SyS_newstat+0x38/0x60
  [c000000639be3e30] c00000000000b184 system_call+0x38/0xe0
  --- Exception: c01 (System Call) at 00003fff81e80ab0
  SP (3fffd930d6c0) is in userspace

  During initrd process, the  log replay of the filesystem was clean:

  Begin: Mounting root file system ... Begin: Running /scripts/local-top ... 
done.                                               
  Begin: Running /scripts/local-premount ... done.                              
                                                 
  Begin: Will now check root file system ... fsck from util-linux 2.29
  [/sbin/fsck.jfs (1) -- /dev/sda2] fsck.jfs -a /dev/sda2 
  fsck.jfs version 1.1.15, 04-Mar-2011
  processing started: 4/13/2017 0:32:47
  The current device is:  /dev/sda2
  Block size in bytes:  4096
  Filesystem size in blocks:  4882812
  **Phase 0 - Replay Journal Log
  Filesystem is clean.
  done.
  done.

  and on my seventh attempt it ended up booting all the way to the login
  prompt again where I have left it.

  I built a newer kernel based on 4.10.0-19.21 sources with
  CONFIG_JFS_DEBUG enabled and passed jfs.jfsloglevel=4 from the kernel
  command line and it took a few reboots but it hit the crash and
  extracted the traces. I need to review the output to see if there is
  anything there.

  Just before it faults, it called chkSuper() as part of mount (or
  remount) of the / volume.

  [    3.978993] __get_metapage: returning = 0xc000000386731ba8 data = 
0xc0000000fcfbb000
  [    3.978994] release_metapage: mp = 0xc000000386731ba8, flag = 0x1
  [    3.991206] jfs_lookup: name = blkid.conf
  [    3.991208] __get_metapage: ino = 32, lblock = 0x10020, abs=1
  [    3.991209] __get_metapage: returning = 0xc0000003852a8868 data = 
0xc00000037d880000
  [    3.991211] release_metapage: mp = 0xc0000003852a8868, flag = 0x1
  [    3.991238] jfs_lookup: name = mount.jfs
  [    3.991239] __get_metapage: ino = 4114, lblock = 0x3c013b, abs=1
  [    3.991240] __get_metapage: returning = 0xc0000003852a8868 data = 
0xc0000000fcfbb000
  [    3.991241] release_metapage: mp = 0xc0000003852a8868, flag = 0x1
  [    4.003520] superblock: flag:0x10200900 state:0x00000000 size:0x251ada0
  [    4.003521] Unable to handle kernel paging request for data at address 
0x00000000
  [    4.003527] __get_metapage: ino = 16, lblock = 0x0, abs=0
  [    4.003529] __get_metapage: ino = 16, lblock = 0x100002, abs=1
  [    4.003533] Faulting instruction address: 0xd0000000062effc4

  Since a successful boot wraps the kernel log buffer at the point it
  would have logged a successful chkSuper(), I don't have a baseline to
  compare but it is odd that there is no

  __get_metapage: returning = 0xc000000387a19330 data = 0xc0000000fcbc4000
  after the first
   __get_metapage: ino = 16, lblock = 0x0, abs=0

  and as mentioned now it seems with the debug code included, it takes
  maybe five or six boots between failures but it seems consistent. Here
  is another crash from a different boot:

  [    4.000793] jfs_lookup: name = mount.jfs
  [    4.000795] __get_metapage: ino = 4114, lblock = 0x3c013b, abs=1
  [    4.000796] __get_metapage: returning = 0xc000000639db5208 data = 
0xc0000000fdf8b000
  [    4.000798] release_metapage: mp = 0xc000000639db5208, flag = 0x1
  [    4.010217] superblock: flag:0x10200900 state:0x00000000 size:0x251ada0
  [    4.010224] Unable to handle kernel paging request for data at address 
0x00000000
  [    4.010227] __get_metapage: ino = 16, lblock = 0x0, abs=0
  [    4.010240] __get_metapage: ino = 16, lblock = 0x100002, abs=1
  [    4.010242] Faulting instruction address: 0xd000000006d6ffc4

  Although JFS originated from IBM, we haven't had a Linux JFS developer
  in years so mirroring to Canonical for their awareness of this problem
  and to see if they happen to have anyone knowledgeable in JFS
  internals while we keep trying to isolate this further.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1685899/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to