Hi,
We started having the same issue after upgrading servers from 2.12.9 to
2.15.5 and clients from 2.15.3 to 2.15.5. Only a couple of older OSS had
the issue. They use Connectx-3 FDR card and the mlx4 driver. After
replacing them with newer Connectx-4, which use the mlx5 driver, we
haven't had issue so far. We still have FDR/mlx4 clients using it.
It is the OS (Rocky 8 on servers and Rocky 9 on clients) provided drivers.
Are you using IB cards that use mlx4 driver on the OSS.
Cheers,
Hans Henrik
On 04/09/2024 19.50, Alastair Basden via lustre-discuss wrote:
Hi Makie,
Yes, sorry, that should be:
From the client (172.18.178.216):
lnetctl ping 172.18.185.8@o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.185.8@o2ib: Input/output error
From the server (172.18.185.8):
lnetctl ping 172.18.178.216@o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.178.216@o2ib: Input/output error
And yet a standard ping works.
Pinging to/from other clients and other OSSs works. i.e. the file
system is fully functional and in production, just this client and one
or two others are having problems.
We are a link down on the core-edge switch link on the edge switch
with this client attached. Given that a standard ping works,
connectivity is there. But perhaps there is some rdma issue?
Cheers,
Alastair.
On Wed, 4 Sep 2024, Makia Minich wrote:
[You don't often get email from [email protected]. Learn
why this is important at https://aka.ms/LearnAboutSenderIdentification ]
[EXTERNAL EMAIL]
The IP for the nid in your “net show” isn’t any of the nids you
pinged. Is an address misconfigured somewhere?
On Sep 4, 2024, at 2:52 AM, Alastair Basden via lustre-discuss
<[email protected]> wrote:
Hi,
We are having some Lnet issues, and wonder if anyone can advise.
Client is 2.15.5, server is 2.12.6.
Fabric is IB.
The file system mounts, but OSTs on a couple of OSSs are not
contactable.
Client and servers can ping each other over the IB network.
However, a lnetctl ping fails to/from the bad OSSs to this client.
To other clients it's all fine.
i.e. for most of the clients it is working well, just one or two not
so.
Server to client:
lnetctl ping 172.18.178.201@o2ib
manage:
- ping:
errno: -1
descr: failed to ping 172.18.178.201@o2ib: Input/output error
Client to server:
anage:
- ping:
errno: -1
descr: failed to ping 172.18.185.10@o2ib: Input/output error
And the o2ib network is noted as down:
lnetctl net show --net o2ib --verbose
net:
- net type: o2ib
local NI(s):
- nid: 172.18.178.216@o2ib
status: down
interfaces:
0: ibs1f0
statistics:
send_count: 45032
recv_count: 45030
drop_count: 0
tunables:
peer_timeout: 100
peer_credits: 32
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 16
map_on_demand: 1
concurrent_sends: 32
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
dev cpt: 0
CPT: "[0,1]"
Could this be a hardware error, even though the IB is working?
Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?
Are there any suggestions on how to bring up the lnet network or fix
the problems?
Thanks,
Alastair.
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org