Hello,

I have two nodes which hang on  ‘mmshutdown’, in detail the command 
‘/sbin/rmmod mmfs26’ hangs. I get kernel messages which I append below. I 
wonder if this looks familiar to somebody? Is it a known bug?  I can avoid the 
issue if I reduce pagepool from 128G to 64G.

Running ‘systemctl stop gpfs’ shows the same issue. It forcefully terminates 
after a while, but ‘rmmod’ stays stuck.

Two functions cxiReleaseAndForgetPages and put_page seem to be involved,  the 
first part of gpfs, the second a kernel call.

The servers have 256G memory  and 72 (virtual) cores each.
I run 5.0.1-1 on RHEL7.4  with kernel 3.10.0-693.17.1.el7.x86_64.

I can try to switch back to 5.0.0

Thank you & kind regards,

Heiner



Jul 11 14:12:04 node-1.x.y mmremote[1641]: Unloading module mmfs26
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The Spectrum Scale 
service process not running on this node. Normal operation cannot be done
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [I] Event raised: The Spectrum Scale 
service process is running
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The node is not 
able to form a quorum with the other available nodes.
Jul 11 14:12:38 node-1.x.y sshd[2826]: Connection closed by xxx port 52814 
[preauth]

Jul 11 14:12:41 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28 
stuck for 23s! [rmmod:2695]

Jul 11 14:12:41 node-1.x.y kernel: Modules linked in: mmfs26(OE-) mmfslinux(OE) 
tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) 
ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) 
mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) vfat 
fat ext4 sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi 
mbcache jbd2 kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw 
gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ipmi_ssif 
pcc_cpufreq hpilo ipmi_si sg hpwdt pcspkr i2c_i801 lpc_ich ipmi_devintf wmi 
ioatdma shpchp ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif 
crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
Jul 11 14:12:41 node-1.x.y kernel:  sysimgblt fb_sys_fops ttm ixgbe 
mlx4_core(OE) crct10dif_pclmul mdio mlx_compat(OE) crct10dif_common drm ptp 
crc32c_intel devlink hpsa pps_core i2c_core scsi_transport_sas dca dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: tracedev]
Jul 11 14:12:41 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted: G     
   W  OEL ------------   3.10.0-693.17.1.el7.x86_64 #1
Jul 11 14:12:41 node-1.x.y kernel: Hardware name: HP ProLiant DL380 
Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
Jul 11 14:12:41 node-1.x.y kernel: task: ffff8808c4814f10 ti: ffff881619778000 
task.ti: ffff881619778000
Jul 11 14:12:41 node-1.x.y kernel: RIP: 0010:[<ffffffff816a2970>]  
[<ffffffff816a2970>] put_compound_page+0xc3/0x174
Jul 11 14:12:41 node-1.x.y kernel: RSP: 0018:ffff88161977bd50  EFLAGS: 00000246
Jul 11 14:12:41 node-1.x.y kernel: RAX: 0000000000000283 RBX: 00000000fae3d201 
RCX: 0000000000000284
Jul 11 14:12:41 node-1.x.y kernel: RDX: 0000000000000283 RSI: 0000000000000246 
RDI: ffffea003d478000
Jul 11 14:12:41 node-1.x.y kernel: RBP: ffff88161977bd68 R08: ffff881ffae3d1e0 
R09: 0000000180800059
Jul 11 14:12:41 node-1.x.y kernel: R10: 00000000fae3d201 R11: ffffea007feb8f40 
R12: 00000000fae3d201
Jul 11 14:12:41 node-1.x.y kernel: R13: ffff88161977bd40 R14: 0000000000000000 
R15: ffff88161977bd40
Jul 11 14:12:41 node-1.x.y kernel: FS:  00007f81a1db0740(0000) 
GS:ffff883ffee80000(0000) knlGS:0000000000000000
Jul 11 14:12:41 node-1.x.y kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Jul 11 14:12:41 node-1.x.y kernel: CR2: 00007fa96e38f980 CR3: 0000000c36b2c000 
CR4: 00000000001607e0
Jul 11 14:12:41 node-1.x.y kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Jul 11 14:12:41 node-1.x.y kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 
DR7: 0000000000000400

Jul 11 14:12:41 node-1.x.y kernel: Call Trace:
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff81192275>] put_page+0x45/0x50
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3562>] 
cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3ae5>] 
cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff811e0b02>] ? 
kmem_cache_free+0x1e2/0x200
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3cda>] 
cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0c70c12>] 
kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0c5bd15>] mmfs+0xc85/0xca0 
[mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08c8f16>] gpfs_clean+0x26/0x30 
[mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0da5565>] 
cleanup_module+0x25/0x30 [mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff8110044b>] 
SyS_delete_module+0x19b/0x300
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff816b89fd>] 
system_call_fastpath+0x16/0x1b
Jul 11 14:12:41 node-1.x.y kernel: Code: d1 00 00 00 4c 89 e7 e8 3a ff ff ff e9 
c4 00 00 00 4c 39 e3 74 c1 41 8b 54 24 1c 85 d2 74 b8 8d 4a 01 89 d0 f0 41 0f 
b1 4c 24 1c <39> c2 74 04 89 c2 eb e8 e8 f3 f0 ae ff 49 89 c5 f0 41 0f ba 2c

Jul 11 14:13:23 node-1.x.y systemd[1]: gpfs.service stopping timed out. 
Terminating.

Jul 11 14:13:27 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28 
stuck for 21s! [rmmod:2695]

Jul 11 14:13:27 node-1.x.y kernel: Modules linked in: mmfs26(OE-) mmfslinux(OE) 
tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) 
ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) 
mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) vfat 
fat ext4 sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi 
mbcache jbd2 kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw 
gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ipmi_ssif 
pcc_cpufreq hpilo ipmi_si sg hpwdt pcspkr i2c_i801 lpc_ich ipmi_devintf wmi 
ioatdma shpchp ipmi_msghandler
Jul 11 14:13:27 node-1.x.y kernel: INFO: rcu_sched detected stalls on 
CPUs/tasks:
Jul 11 14:13:27 node-1.x.y kernel:  {
Jul 11 14:13:27 node-1.x.y kernel:  28
Jul 11 14:13:27 node-1.x.y kernel: }
Jul 11 14:13:27 node-1.x.y kernel: (detected by 17, t=60002 jiffies, g=267734, 
c=267733, q=36089)
Jul 11 14:13:27 node-1.x.y kernel: Task dump for CPU 28:
Jul 11 14:13:27 node-1.x.y kernel: rmmod           R
Jul 11 14:13:27 node-1.x.y kernel:   running task
Jul 11 14:13:27 node-1.x.y kernel:     0  2695   2642 0x00000008
Jul 11 14:13:27 node-1.x.y kernel: Call Trace:
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff811dea1c>] ? 
__free_slab+0xdc/0x200
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816a28ad>] ? 
__put_compound_page+0x22/0x22
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff81192275>] ? put_page+0x45/0x50
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3562>] ? 
cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3ae5>] ? 
cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3cda>] ? 
cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c70c12>] ? 
kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c5bd15>] ? mmfs+0xc85/0xca0 
[mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08c8f16>] ? gpfs_clean+0x26/0x30 
[mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0da5565>] ? 
cleanup_module+0x25/0x30 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff8110044b>] ? 
SyS_delete_module+0x19b/0x300
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816b89fd>] ? 
system_call_fastpath+0x16/0x1b
Jul 11 14:13:27 node-1.x.y kernel:  acpi_power_meter
Jul 11 14:13:27 node-1.x.y kernel:  binfmt_misc nfsd auth_rpcgss nfs_acl lockd 
grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic 
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm ixgbe mlx4_core(OE) crct10dif_pclmul mdio mlx_compat(OE) 
crct10dif_common drm ptp crc32c_intel devlink hpsa pps_core i2c_core 
scsi_transport_sas dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
tracedev]
Jul 11 14:13:27 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted: G     
   W  OEL ------------   3.10.0-693.17.1.el7.x86_64 #1
Jul 11 14:13:27 node-1.x.y kernel: Hardware name: HP ProLiant DL380 
Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
Jul 11 14:13:27 node-1.x.y kernel: task: ffff8808c4814f10 ti: ffff881619778000 
task.ti: ffff881619778000
Jul 11 14:13:27 node-1.x.y kernel: RIP: 0010:[<ffffffff816a28ad>]  
[<ffffffff816a28ad>] __put_compound_page+0x22/0x22
Jul 11 14:13:27 node-1.x.y kernel: RSP: 0018:ffff88161977bd70  EFLAGS: 00000282
Jul 11 14:13:27 node-1.x.y kernel: RAX: 002fffff00008010 RBX: 0000000000000135 
RCX: 00000000000001c1
Jul 11 14:13:27 node-1.x.y kernel: RDX: ffff8814adbbf000 RSI: 0000000000000246 
RDI: ffffea00650e7040
Jul 11 14:13:27 node-1.x.y kernel: RBP: ffff88161977bd78 R08: ffff881ffae3df60 
R09: 0000000180800052
Jul 11 14:13:27 node-1.x.y kernel: R10: 00000000fae3db01 R11: ffffea007feb8f40 
R12: ffff881ffae3df60
Jul 11 14:13:27 node-1.x.y kernel: R13: 0000000180800052 R14: 00000000fae3db01 
R15: ffffea007feb8f40
Jul 11 14:13:27 node-1.x.y kernel: FS:  00007f81a1db0740(0000) 
GS:ffff883ffee80000(0000) knlGS:0000000000000000
Jul 11 14:13:27 node-1.x.y kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Jul 11 14:13:27 node-1.x.y kernel: CR2: 00007fa96e38f980 CR3: 0000000c36b2c000 
CR4: 00000000001607e0
Jul 11 14:13:27 node-1.x.y kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Jul 11 14:13:27 node-1.x.y kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 
DR7: 0000000000000400
Jul 11 14:13:27 node-1.x.y kernel: Call Trace:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff81192275>] ? put_page+0x45/0x50
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3562>] 
cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3ae5>] 
cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3cda>] 
cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c70c12>] 
kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c5bd15>] mmfs+0xc85/0xca0 
[mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08c8f16>] gpfs_clean+0x26/0x30 
[mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0da5565>] 
cleanup_module+0x25/0x30 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff8110044b>] 
SyS_delete_module+0x19b/0x300
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816b89fd>] 
system_call_fastpath+0x16/0x1b
Jul 11 14:13:27 node-1.x.y kernel: Code: c0 0f 95 c0 0f b6 c0 5d c3 0f 1f 44 00 
00 55 48 89 e5 53 48 8b 07 48 89 fb a8 20 74 05 e8 0c f8 ae ff 48 89 df ff 53 
60 5b 5d c3 <0f> 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 07 48 89 fb f6

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to