Re: [lustre-discuss] [EXTERNAL] Re: lnet instability over infiniband when running el9 + connextX-3 hardware

Kurt Strosahl via lustre-discuss Mon, 24 Jun 2024 06:53:25 -0700

Good Morning,

Is there anything in particular I should be looking for?


I'm seeing messages like this:

Jun 21 08:11:57 125855 [725C7640] 0x02 -> do_sweep: Entering heavy sweep with 
flags: force_heavy_sweep 1, coming out of standby 0, subnet initialization 
error 0, sm port change 0
Jun 21 08:11:57 137557 [8E3FD640] 0x02 -> osm_spst_rcv_process: Switch 
0xfc6a1c03006047c0 MF0;qcd24s-ndr-leaf-2-sw:MQM9700/U1 port 85 (1/22/1/1) 
changed state from ACTIVE to DOWN
Jun 21 08:11:57 140078 [883F1640] 0x02 -> osm_pi_rcv_process: Switch 
0xfc6a1c03006047c0 MF0;qcd24s-ndr-leaf-2-sw:MQM9700/U1 port 85(1/22/1/1) 
changed state from ACTIVE to DOWN
Jun 21 08:11:57 152147 [725C7640] 0x02 -> log_notice: Reporting Generic Notice 
type:3 num:65 (GID out of service) from LID:15 GID:fe80::b83f:d203:e8:2320
Jun 21 08:11:57 152235 [725C7640] 0x02 -> drop_mgr_remove_port: Removed port 
with GUID:0xb83fd20300e82320 LID range [452, 452] of node:MT4129 ConnectX7   
Mellanox Technologies
Jun 21 08:11:57 156557 [725C7640] 0x02 -> updn_lid_matrices: disabling UPDN 
algorithm, no root nodes were found
Jun 21 08:11:57 156571 [725C7640] 0x01 -> ucast_mgr_route: ar_updn: cannot 
build lid matrices.
Jun 21 08:11:57 159797 [725C7640] 0x02 -> osm_ucast_mgr_process: minhop tables 
configured on all switches
Jun 21 08:11:57 181996 [6FDC2640] 0x01 -> log_rcv_cb_error: ERR 3111: Received 
MAD with error status = 0x1C
                        SubnGetResp(SLtoVLMappingTable), attr_mod 0x30000, TID 
0x11c66ae, dest_guid 0x0000000000000000
                        Initial path: 0,1,32,16 Return path: 0,22,23,19
Jun 21 08:11:57 182104 [6FDC2640] 0x01 -> log_rcv_cb_error: ERR 3111: Received 
MAD with error status = 0x1C
                        SubnGetResp(SLtoVLMappingTable), attr_mod 0x30000, TID 
0x11c66a2, dest_guid 0x0000000000000000
                        Initial path: 0,1,31,3 Return path: 0,22,20,19
Jun 21 08:11:57 182135 [6FDC2640] 0x01 -> log_rcv_cb_error: ERR 3111: Received 
MAD with error status = 0x1C
                        SubnGetResp(SLtoVLMappingTable), attr_mod 0x30000, TID 
0x11c6695, dest_guid 0x0000000000000000
                        Initial path: 0,1,13 Return path: 0,22,7
Jun 21 08:11:57 310544 [725C7640] 0x02 -> SUBNET UP
Jun 21 08:12:03 569958 [89BF4640] 0x01 -> log_trap_info: Received Generic 
Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:247 
TID:0x000018f300000080
Jun 21 08:12:03 570010 [89BF4640] 0x02 -> SM class trap 128: Directed Path Dump 
of 4 hop path: Path = 0,1,26,19,37
Jun 21 08:12:03 570019 [89BF4640] 0x02 -> log_notice: Reporting Generic Notice 
type:1 num:128 (Link state change) from LID:247 GID:fe80::fc6a:1c03:60:47c0
Jun 21 08:12:03 570054 [725C7640] 0x02 -> do_sweep:

w/r,
Kurt
________________________________
From: John Hearns <[email protected]>
Sent: Saturday, June 22, 2024 2:54 AM
To: Kurt Strosahl <[email protected]>
Cc: [email protected] <[email protected]>; 
[email protected] <[email protected]>
Subject: [EXTERNAL] Re: [lustre-discuss] lnet instability over infiniband when 
running el9 + connextX-3 hardware

Have you had a close look at the logs from your subnet manager?
Assuming you run Opensm on a server this is opensm.log




On Fri, 21 Jun 2024 at 16:35, Kurt Strosahl via lustre-discuss 
<[email protected]<mailto:[email protected]>> wrote:
Good Morning,

    We've been experiencing a fairly nasty issue with our clients following our 
move to Alma 9.  It seems to occur randomly (a few days to over a week), the 
clients with connectX-3 cards start getting lnet network errors and seeing 
moving hangs on random osts spread across our oss systems, as well as issues 
talking with the mgs.  This can then trigger crash cycles on the oss systems 
themselves (again in the lnet layer).  The only answer we have found so far is 
to power down all the impacted clients and let the impacted oss systems reboot.

Here is a snippet of the error as we see it on the client:
Jun21 08:16] Lustre: lustre19-OST0020-osc-ffff934c22a29800: Connection restored 
to 172.17.0.97@o2ib (at 172.17.0.97@o2ib)
[  +0.000006] Lustre: Skipped 2 previous similar messages
[  +3.079695] Lustre: lustre19-MDT0000-mdc-ffff934c22a29800: Connection 
restored to 172.17.0.37@o2ib (at 172.17.0.37@o2ib)
[  +0.223480] LustreError: 4478:0:(events.c:211:client_bulk_callback()) event 
type 2, status -5, desc 00000000784c6e4f
[  +0.000007] LustreError: 4478:0:(events.c:211:client_bulk_callback()) Skipped 
3 previous similar messages
[ +22.955501] Lustre: 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) @@@ 
Request sent has failed due to network error: [sent 1718972176/real 1718972176] 
 req@000000008c377199 x1801581392820160/t0(0) 
o13->[email protected]@o2ib:7/4 lens 224/368 e 
0 to 1 dl 1718972183 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'lfs.7953'
[  +0.000006] Lustre: 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) 
Skipped 21 previous similar messages
[ +20.333921] Lustre: lustre19-OST000a-osc-ffff934c22a29800: Connection 
restored to 172.17.0.39@o2ib (at 172.17.0.39@o2ib)
[Jun21 08:17] LustreError: 166-1: MGC172.17.0.36@o2ib: Connection to MGS (at 
172.17.0.37@o2ib) was lost; in progress operations using this service will fail
[  +0.000302] Lustre: lustre19-OST0046-osc-ffff934c22a29800: Connection to 
lustre19-OST0046 (at 172.17.0.103@o2ib) was lost; in progress operations using 
this service will wait for recovery to complete
[  +0.000005] Lustre: Skipped 6 previous similar messages
[  +6.144196] Lustre: MGC172.17.0.36@o2ib: Connection restored to 
172.17.0.37@o2ib (at 172.17.0.37@o2ib)
[  +0.000006] Lustre: Skipped 1 previous similar message

We have a mix of client hardware, but the systems are uniform in their kernels 
and lustre clients.

Here are the software versions:
kernel-modules-core-5.14.0-362.24.1.el9_3.x86_64
kernel-core-5.14.0-362.24.1.el9_3.x86_64
kernel-modules-5.14.0-362.24.1.el9_3.x86_64
kernel-5.14.0-362.24.1.el9_3.x86_64
texlive-l3kernel-20200406-26.el9_2.noarch
kernel-modules-core-5.14.0-362.24.2.el9_3.x86_64
kernel-core-5.14.0-362.24.2.el9_3.x86_64
kernel-modules-5.14.0-362.24.2.el9_3.x86_64
kernel-tools-libs-5.14.0-362.24.2.el9_3.x86_64
kernel-tools-5.14.0-362.24.2.el9_3.x86_64
kernel-5.14.0-362.24.2.el9_3.x86_64
kernel-headers-5.14.0-362.24.2.el9_3.x86_64

and lustre:
kmod-lustre-client-2.15.4-1.el9.jlab.x86_64
lustre-client-2.15.4-1.el9.jlab.x86_64

Our oss systems are running el7, are running MOFED for their infiniband stack, 
and have ConnectX-3 cards
kernel-tools-libs-3.10.0-1160.76.1.el7.x86_64
kernel-tools-3.10.0-1160.76.1.el7.x86_64
kernel-headers-3.10.0-1160.76.1.el7.x86_64
kernel-abi-whitelists-3.10.0-1160.76.1.el7.noarch
kernel-devel-3.10.0-1160.76.1.el7.x86_64
kernel-3.10.0-1160.76.1.el7.x86_64

and lustre version
lustre-2.12.9-1.el7.x86_64
kmod-lustre-osd-zfs-2.12.9-1.el7.x86_64
lustre-osd-zfs-mount-2.12.9-1.el7.x86_64
lustre-resource-agents-2.12.9-1.el7.x86_64
kmod-lustre-2.12.9-1.el7.x86_64

w/r,

Kurt J. Strosahl (he/him)
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility

_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwMFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=a1-ymUluZsecMceDMlAHsomwMJl4Iqg-UcfvwQZVldk&m=evCzaFF_sTaw6JZkUFbPDrZHAV1p1rM2cLEUpWtDXJy30A8giJTDEuzJYtp95Cjn&s=G36HmkhzdQaIxK_Jb2HczSbyNlu3KAd4KLC_zQJXU7I&e=>

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] [EXTERNAL] Re: lnet instability over infiniband when running el9 + connextX-3 hardware

Reply via email to