Hi, Philipp, FYI. We had exactly the same IBV_WC_RETRY_EXC_ERR error message in our gpfs client log along with other client error kernel: ib0: ipoib_cm_handle_tx_wc_rss: failed cm send event (status=12, wrid=83 vend_err 81) in the syslog. The root cause was a bad IB cable connecting a leaf switch to the core switch where the client used as route. When we changed a new cable, the problem was solved and no more errors. We don't really have ipoib setup. The problem might be different from yours, but does the error message suggest that when your gpfs daemon tries to use mlx5_1, the packets are discarded so no connection? Did you do an IB bonding?
Wei Guo HPC Administrator UTSW Message: 1 Date: Mon, 12 Mar 2018 21:09:14 +0100 From: Philipp Helo Rehs <[email protected]> To: [email protected] Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR Message-ID: <[email protected]> Content-Type: text/plain; charset=utf-8 Hello, I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here. We have a gpfs cluster with 4 nsds and 120 clients on Infiniband. Our NSD-Server have two infiniband ports on seperate cards mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces. But when we enable the second interface we get the following error from all nodes: 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31 2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129 I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed. Current config: minReleaseLevel 4.2.3.0 maxblocksize 16m cipherList AUTHONLY cesSharedRoot /ces ccrEnabled yes failureDetectionTime 40 leaseRecoveryWait 40 [hilbert1-ib,hilbert2-ib] worker1Threads 256 maxReceiverThreads 256 [common] tiebreakerDisks vd3;vd5;vd7 minQuorumNodes 2 verbsLibName libibverbs.so.1 verbsRdma enable verbsRdmasPerNode 256 verbsRdmaSend no scatterBufferSize 262144 pagepool 16g verbsPorts mlx4_0/1 [nsdNodes] verbsPorts mlx5_0/1 mlx5_1/1 [hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib] verbsPorts mlx4_0/1 mlx4_1/1 [common] maxMBpS 11200 [common] verbsRdmaCm enable verbsRdmasPerConnection 14 adminMode central Kind regards Philipp Rehs --------------------------- Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern Heinrich-Heine-Universit?t D?sseldorf Universit?tsstr. 1 Raum 25.41.00.51 40225 D?sseldorf / Germany Tel: +49-211-81-15557 ________________________________ UT Southwestern Medical Center The future of medicine, today. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
