Hello all, I'm hoping that someone might be able to help me with an issue I've been seeing periodically since updating to 2.15.x.
Back in July we remade our scratch storage volume with 2.15 after running 2.12 for a long time. As part of the upgrade we reinstalled our oss and mds nodes with rhel 8 so we'd finally be able to take advantage of project quotas. For background, our nodes are connected via omnipath and we have a mix of el7 clients, el8 clients, and el9 clients. We've been piecemeal updating our clients to el9 over the past few months with the goal of being 99% el9 clients by mid January. Since upgrading we've seen recurring connectivity issues arise in the cluster from time to time which seem to be very strongly correlated with a client crashing. The fabric itself seems fine. There's no evidence of error or any packet loss, so I cannot confidently blame it. On an oss that's having trouble communicating we see the following messages for various osts: kernel: LNetError: 31282:0:(o2iblnd_cb.c:3358:kiblnd_check_txs_locked()) Timed out tx: active_txs(WSQ:100), 19 seconds kernel: LNetError: 31282:0:(o2iblnd_cb.c:3358:kiblnd_check_txs_locked()) Skipped 6 previous similar messages kernel: LNetError: 31282:0:(o2iblnd_cb.c:3428:kiblnd_check_conns()) Timed out RDMA with 172.16.100.19@o2ib (4): c: 31, oc: 0, rc: 31 kernel: LNetError: 31282:0:(o2iblnd_cb.c:3428:kiblnd_check_conns()) Skipped 6 previous similar messages kernel: LustreError: 109351:0:(ldlm_lib.c:3543:target_bulk_io()) @@@ network error on bulk WRITE req@00000000f848e208 x1751146351599040/t0(0) o4- >[email protected]@o2ib:19/0 lens 488/448 e 0 to 0 dl 1671634949 ref 1 fl Interpret:/2/0 rc 0/0 job:'873032' kernel: Lustre: work-OST0007: Bulk IO write error with 10ccde33-01ef-47cf-873d- da9a1b6bb1ea (at 172.16.100.19@o2ib), client will retry: rc = -110 kernel: Lustre: Skipped 5 previous similar messages kernel: LustreError: 109351:0:(ldlm_lib.c:3543:target_bulk_io()) @@@ network error on bulk WRITE req@000000008176189a x1751153771751616/t0(0) o4- >[email protected]@o2ib:94/0 lens 488/448 e 0 to 0 dl 1671635024 ref 1 fl Interpret:/0/0 rc 0/0 job:'873641' Lustre: work-OST000d: Bulk IO write error with 63e88958-d0b1-4c7f-8413- da96b181cd92 (at 172.16.100.25@o2ib), client will retry: rc = -110 An already connected client will see messages along the lines of: kernel: Lustre: 3736:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1671634931/real 1671634931] req@ffff94a0ed490000 x1751139078233344/t0(0) o9- >[email protected]@o2ib:28/4 lens 224/224 e 0 to 1 A stack trace on one of the clients of an ls against the volume that hangs: [<ffffffffc1576d05>] cl_sync_io_wait+0x1c5/0x480 [obdclass] [<ffffffffc1572943>] cl_lock_request+0x1d3/0x210 [obdclass] [<ffffffffc17db1d9>] cl_glimpse_lock+0x329/0x380 [lustre] [<ffffffffc17db5a5>] cl_glimpse_size0+0x255/0x280 [lustre] [<ffffffffc1793cdc>] ll_getattr_dentry+0x50c/0x9c0 [lustre] [<ffffffffc17941ae>] ll_getattr+0x1e/0x20 [lustre] [<ffffffff99a53d49>] vfs_getattr+0x49/0x80 [<ffffffff99a53e55>] vfs_fstatat+0x75/0xc0 [<ffffffff99a54261>] SYSC_newlstat+0x31/0x60 [<ffffffff99a546ce>] SyS_newlstat+0xe/0x10 [<ffffffff99f99f92>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff If we try to reboot a client to clear the issue it won't be able to mount the filesystem; the mount process hangs in D until it times out. A stack trace from the mount: [<0>] llog_process_or_fork+0x2de/0x570 [obdclass] [<0>] llog_process+0x10/0x20 [obdclass] [<0>] class_config_parse_llog+0x1eb/0x3e0 [obdclass] [<0>] mgc_process_cfg_log+0x659/0xc90 [mgc] [<0>] mgc_process_log+0x667/0x800 [mgc] [<0>] mgc_process_config+0x42b/0x6e0 [mgc] [<0>] obd_process_config.constprop.0+0x76/0x1a0 [obdclass] [<0>] lustre_process_log+0x562/0x8f0 [obdclass] [<0>] ll_fill_super+0x6ec/0x1020 [lustre] [<0>] lustre_fill_super+0xe4/0x470 [lustre] [<0>] mount_nodev+0x41/0x90 [<0>] legacy_get_tree+0x24/0x40 [<0>] vfs_get_tree+0x22/0xb0 [<0>] do_new_mount+0x176/0x310 [<0>] __x64_sys_mount+0x103/0x140 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae When this has happened in the past we've had to shut down the entire lustre backend, and if we're lucky clients would start recovering once everything was back up again. The last time this happened we actually had to shut down the entire cluster. In this particular case rebooting the osses and mds/mgs has stopped the errors on the server side, but el9 and el8 clients are now seeing: kernel: LNetError: 149017:0:(o2iblnd_cb.c:966:kiblnd_post_tx_locked()) Error -22 posting transmit to 172.16.100.250@o2ib kernel: LNetError: 149017:0:(o2iblnd_cb.c:966:kiblnd_post_tx_locked()) Skipped 31 previous similar messages and LustreError: 3585788:0:(file.c:5096:ll_inode_revalidate_fini()) work: revalidate FID [0x200000007:0x1:0x0] error: rc = -4 el7 clients have recovered seemingly without issue. Wondering if anyone might have any suggestions on where to look as to why this is breaking, or pointers as to how to recover from this situation without rebooting if possible. _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
