Have you had a close look at the logs from your subnet manager? Assuming you run Opensm on a server this is opensm.log
On Fri, 21 Jun 2024 at 16:35, Kurt Strosahl via lustre-discuss < [email protected]> wrote: > Good Morning, > > We've been experiencing a fairly nasty issue with our clients > following our move to Alma 9. It seems to occur randomly (a few days to > over a week), the clients with connectX-3 cards start getting lnet network > errors and seeing moving hangs on random osts spread across our oss > systems, as well as issues talking with the mgs. This can then trigger > crash cycles on the oss systems themselves (again in the lnet layer). The > only answer we have found so far is to power down all the impacted clients > and let the impacted oss systems reboot. > > Here is a snippet of the error as we see it on the client: > Jun21 08:16] Lustre: lustre19-OST0020-osc-ffff934c22a29800: Connection > restored to 172.17.0.97@o2ib (at 172.17.0.97@o2ib) > > > [ +0.000006] Lustre: Skipped 2 previous similar messages > [ +3.079695] Lustre: lustre19-MDT0000-mdc-ffff934c22a29800: Connection > restored to 172.17.0.37@o2ib (at 172.17.0.37@o2ib) > > > [ +0.223480] LustreError: 4478:0:(events.c:211:client_bulk_callback()) > event type 2, status -5, desc 00000000784c6e4f > [ +0.000007] LustreError: 4478:0:(events.c:211:client_bulk_callback()) > Skipped 3 previous similar messages > [ +22.955501] Lustre: > 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) @@@ Request sent has > failed due to network error: [sent 1718972176/real 1718972176] > req@000000008c377199 x1801581392820160/t0(0) > o13->[email protected]@o2ib:7/4 lens > 224/368 e 0 to 1 dl 1718972183 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 > job:'lfs.7953' > [ +0.000006] Lustre: > 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) Skipped 21 previous > similar messages > [ +20.333921] Lustre: lustre19-OST000a-osc-ffff934c22a29800: Connection > restored to 172.17.0.39@o2ib (at 172.17.0.39@o2ib) > > > [Jun21 08:17] LustreError: 166-1: MGC172.17.0.36@o2ib: Connection to MGS > (at 172.17.0.37@o2ib) was lost; in progress operations using this service > will fail > > [ +0.000302] Lustre: lustre19-OST0046-osc-ffff934c22a29800: Connection to > lustre19-OST0046 (at 172.17.0.103@o2ib) was lost; in progress operations > using this service will wait for recovery to complete > > [ +0.000005] Lustre: Skipped 6 previous similar messages > [ +6.144196] Lustre: MGC172.17.0.36@o2ib: Connection restored to > 172.17.0.37@o2ib (at 172.17.0.37@o2ib) > [ +0.000006] Lustre: Skipped 1 previous similar message > > We have a mix of client hardware, but the systems are uniform in their > kernels and lustre clients. > > Here are the software versions: > kernel-modules-core-5.14.0-362.24.1.el9_3.x86_64 > kernel-core-5.14.0-362.24.1.el9_3.x86_64 > kernel-modules-5.14.0-362.24.1.el9_3.x86_64 > kernel-5.14.0-362.24.1.el9_3.x86_64 > texlive-l3kernel-20200406-26.el9_2.noarch > kernel-modules-core-5.14.0-362.24.2.el9_3.x86_64 > kernel-core-5.14.0-362.24.2.el9_3.x86_64 > kernel-modules-5.14.0-362.24.2.el9_3.x86_64 > kernel-tools-libs-5.14.0-362.24.2.el9_3.x86_64 > kernel-tools-5.14.0-362.24.2.el9_3.x86_64 > kernel-5.14.0-362.24.2.el9_3.x86_64 > kernel-headers-5.14.0-362.24.2.el9_3.x86_64 > > and lustre: > kmod-lustre-client-2.15.4-1.el9.jlab.x86_64 > lustre-client-2.15.4-1.el9.jlab.x86_64 > > Our oss systems are running el7, are running MOFED for their infiniband > stack, and have ConnectX-3 cards > kernel-tools-libs-3.10.0-1160.76.1.el7.x86_64 > kernel-tools-3.10.0-1160.76.1.el7.x86_64 > kernel-headers-3.10.0-1160.76.1.el7.x86_64 > kernel-abi-whitelists-3.10.0-1160.76.1.el7.noarch > kernel-devel-3.10.0-1160.76.1.el7.x86_64 > kernel-3.10.0-1160.76.1.el7.x86_64 > > and lustre version > lustre-2.12.9-1.el7.x86_64 > kmod-lustre-osd-zfs-2.12.9-1.el7.x86_64 > lustre-osd-zfs-mount-2.12.9-1.el7.x86_64 > lustre-resource-agents-2.12.9-1.el7.x86_64 > kmod-lustre-2.12.9-1.el7.x86_64 > > w/r, > > Kurt J. Strosahl (he/him) > System Administrator: Lustre, HPC > Scientific Computing Group, Thomas Jefferson National Accelerator Facility > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
