Re: [gpfsug-discuss] nodes being ejected out of the cluster

Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Wed, 11 Jan 2017 07:16:12 -0800

The RDMA errors I think are secondary to what's going on with either your IPoIB 
or Ethernet fabrics that's causing I assume IPoIB communication breakdowns and 
expulsions. We've had entire IB fabrics go offline and if the nodes werent 
depending on it for daemon communication nobody got expelled. Do you have a 
subnet defined for your IPoIB network or are your nodes daemon interfaces 
already set to their IPoIB interface? Have you checked your SM logs?




From: Damir Krstic
Sent: 1/11/17, 9:39 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] nodes being ejected out of the cluster
We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our storage 
(ESS GL6) is also running GPFS 4.2. Compute nodes and storage are connected via 
Infiniband (FDR14). At the time of implementation of ESS, we were instructed to 
enable RDMA in addition to IPoIB. Previously we only ran IPoIB on our GPFS3.5 
cluster.

Every since the implementation (sometime back in July of 2016) we see a lot of 
compute nodes being ejected. What usually precedes the ejection are following 
messages:

Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR 
index 1
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR 
index 400

Even our ESS IO server sometimes ends up being ejected (case in point - 
yesterday morning):

Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 
vendor_err 135
Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 3001
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 
vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2671
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 
vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2495
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 
vendor_err 135
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 3077
Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr) lease renewal is 
overdue. Pinging to check if it is alive

I've had multiple PMRs open for this issue, and I am told that our ESS needs 
code level upgrades in order to fix this issue. Looking at the errors, I think 
the issue is Infiniband related, and I am wondering if anyone on this list has 
seen similar issues?

Thanks for your help in advance.

Damir

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] nodes being ejected out of the cluster

Reply via email to