Damir,

We see similar things in our environment (3.5k nodes) that seem to correlate 
with GPFS recovery events. I did some digging at it seemed to me that these 
errors more or less mean the other side of the VERBS connection hung up on the 
other. The message format seems a little alarming but I think it's innocuous. 
I'm curious to hear what others have to say.

-Aaron



From: Damir Krstic
Sent: 6/26/16, 11:23 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] verbs rdma errors in logs
We recently enabled verbs/rdma on our IB network (previously we used IPoIB 
exclusively) and now are getting all sorts of errors/warning in logs:

Jun 25 23:41:30 gssio2 mmfs: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 172.41.125.27 (qnode4111-ib0.quest) on mlx5_0 port 1 
fabnum 0 vendor_err 129
Jun 25 23:41:30 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.125.27 
(qnode4111-ib0.quest) on mlx5_0 port 1 fabnum 0 due to RDMA read error 
IBV_WC_RETRY_EXC_ERR index 1589

Jun 25 20:40:05 gssio2 mmfs: [N] VERBS RDMA closed connection to 172.41.124.12 
(qnode4054-ib0.quest) on mlx5_0 port 1 fabnum 0 index 1417

Jun 25 qnode6131-ib0.quest.it.northwestern.edu) on mlx5_0 port 1 fabnum 0 index 
195

Jun 25 qnode6131-ib0.quest.it.northwestern.edu) on mlx5_0 port 1 fabnum 0 index 
1044

Something to note (not sure if this is important or not) is that our ESS 
storage cluster and our login nodes are in connected mode with 64K MTU and all 
compute nodes are in datagram mode with 2.4K MTU.

Are these messages something to be concerned about? Cluster seems to be 
performing well and although there are some node ejections, they do not seem 
higher than before we turned on verbs/rdma.

Thanks,
Damir
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to