Hi, Philipp,

FYI. We had exactly the same IBV_WC_RETRY_EXC_ERR error message in our gpfs 
client log along with other client error  kernel: ib0: 
ipoib_cm_handle_tx_wc_rss: failed cm send event (status=12, wrid=83 vend_err 
81) in the syslog. The root cause was a bad IB cable connecting a leaf switch 
to the core switch where the client used as route. When we changed a new cable, 
the problem was solved and no more errors. We don't really have ipoib setup. 
The problem might be different from yours, but does the error message suggest 
that when your gpfs daemon tries to use mlx5_1, the packets are discarded so no 
connection? Did you do an IB bonding?

Wei Guo
HPC Administrator
UTSW



Message: 1
Date: Mon, 12 Mar 2018 21:09:14 +0100
From: Philipp Helo Rehs <[email protected]>
To: [email protected]
Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8

Hello,
I am reading your mailing-list since some weeks and I am quiete impressed about 
the knowledge and shared information here.

We have a gpfs cluster with 4 nsds and 120 clients on Infiniband.

Our NSD-Server have two infiniband ports on seperate cards
mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We 
have added an IPoIB IP to all interfaces.

But when we enable the second interface we get the following error from all 
nodes:

2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to
10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error 
IBV_WC_RETRY_EXC_ERR index 45
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 
vendor_err 129
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to
10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error 
IBV_WC_RETRY_EXC_ERR index 31
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error 
IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 
vendor_err 129

I have read that this issue can happen when verbsRdmasPerConnection is to low. 
We tried to increase the value and it got better but the problem is not fixed.


Current config:
minReleaseLevel 4.2.3.0
maxblocksize 16m
cipherList AUTHONLY
cesSharedRoot /ces
ccrEnabled yes
failureDetectionTime 40
leaseRecoveryWait 40
[hilbert1-ib,hilbert2-ib]
worker1Threads 256
maxReceiverThreads 256
[common]
tiebreakerDisks vd3;vd5;vd7
minQuorumNodes 2
verbsLibName libibverbs.so.1
verbsRdma enable
verbsRdmasPerNode 256
verbsRdmaSend no
scatterBufferSize 262144
pagepool 16g
verbsPorts mlx4_0/1
[nsdNodes]
verbsPorts mlx5_0/1 mlx5_1/1
[hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib]
verbsPorts mlx4_0/1 mlx4_1/1
[common]
maxMBpS 11200
[common]
verbsRdmaCm enable
verbsRdmasPerConnection 14
adminMode central


Kind regards
 Philipp Rehs

---------------------------

Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r 
wissenschaftliches Rechnen und Speichern

Heinrich-Heine-Universit?t D?sseldorf
Universit?tsstr. 1
Raum 25.41.00.51
40225 D?sseldorf / Germany
Tel: +49-211-81-15557



________________________________

UT Southwestern


Medical Center



The future of medicine, today.


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to