Hello, I’ve been scratching my head for three days now but I cannot do a simple ping over Infiniband using LNet. To be honest I have no idea of whats may be happening. LNet over TCP (on ethernet) seems to work fine. The only way LNet ping works is by pinging itself:
[root@mds1 ~]# lctl ping 10.148.0.20@o2ib1 12345-0@lo 12345-10.24.2.12@tcp1 12345-10.148.0.20@o2ib1 Everything else just fails: [root@mds1 ~]# lctl ping 10.148.0.21@o2ib1 failed to ping 10.148.0.21@o2ib1: Input/output error [root@mds1 ~]# dmesg -T | tail -n 2 [Tue Jan 19 01:26:01 2021] LNet: 2424:0:(o2iblnd_cb.c:3405:kiblnd_check_conns()) Timed out tx for 10.148.0.21@o2ib1: 5095 seconds [Tue Jan 19 01:26:01 2021] LNetError: 2362:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.148.0.21@o2ib1: -125 I can confirm that IPoIB network is working as expected: [root@mds1 ~]# ping 10.148.0.21 PING 10.148.0.21 (10.148.0.21) 56(84) bytes of data. 64 bytes from 10.148.0.21: icmp_seq=1 ttl=64 time=2.52 ms 64 bytes from 10.148.0.21: icmp_seq=2 ttl=64 time=0.085 ms Configuration seem to match between the two example machines: [root@mds1 ~]# ifconfig ib0 | head -n 2 Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520 inet 10.148.0.20 netmask 255.255.0.0 broadcast 10.148.255.255 [root@mds2 ~]# ifconfig ib0 | head -n 2 Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520 inet 10.148.0.21 netmask 255.255.0.0 broadcast 10.148.255.255 Here’s the output of network configuration: [root@mds1 ~]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp1 local NI(s): - nid: 10.24.2.12@tcp1 status: up interfaces: 0: bond0 - net type: o2ib1 local NI(s): - nid: 10.148.0.20@o2ib1 status: up interfaces: 0: ib0 Modules seems to be loaded: [root@mds1 ~]# lsmod | egrep "mlx|mlnx|lnet|rdma|ko2iblnd" lnet_selftest 274357 0 ko2iblnd 238469 1 lnet 595358 4 ko2iblnd,lnet_selftest,ksocklnd libcfs 415577 4 lnet,ko2iblnd,lnet_selftest,ksocklnd rdma_ucm 26931 0 rdma_cm 64252 2 ko2iblnd,rdma_ucm iw_cm 43918 1 rdma_cm ib_cm 53015 3 rdma_cm,ib_ucm,ib_ipoib mlx4_en 142468 0 mlx4_ib 220791 0 mlx4_core 361489 2 mlx4_en,mlx4_ib mlx5_ib 398193 0 ib_uverbs 134646 3 mlx5_ib,ib_ucm,rdma_ucm ib_core 379808 11 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib mlx5_core 1113637 1 mlx5_ib mlxfw 18227 1 mlx5_core devlink 60067 4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core mlx_compat 47141 15 rdma_cm,ib_cm,iw_cm,ko2iblnd,mlx4_en,mlx4_ib,mlx5_ib,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib ptp 23551 3 i40e,mlx4_en,mlx5_core Both systems were running CentOS 7.9, Lustre 2.12.6 (IB Branch) and Mellanox OFED 4.9-2.2.4.0. The only error message that I’ve found is the one that I’ve pasted in the start of this message on dmesg and tem I/O error. Any help is greatly appreciated. Thanks, Vinícius. _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org