Thanks!
I’ve double checked the selinux state, and it is disabled on both all the ESS 
nodes and on all the client nodes.
mmfsd is running as root on all nodes as well.
It seem a bit strange that this would be an issue of permissions though. I 
forgot to state this in my original question, but the issue comes and goes, and 
can affect some clients while not affecting others at the same time, and which 
clients are affected at any given time changes as well.
Just a thought; How does infiniband queue pairs react to time skew between 
nodes?

For future reference, where did you find the specification of ibv_create_qp 
error 13?
I must have been looking in all the wrong places, because I’ve been unable to 
find the meaning of this error.

Regards,
Andreas
_____________________________________________
[cid:[email protected]]

Andreas Mattsson
Systems Engineer

MAX IV Laboratory
Lund University
P.O. Box 118, SE-221 00 Lund, Sweden
Visiting address: Fotongatan 2, 225 94 Lund
Mobile: +46 706 64 95 44
[email protected]<mailto:[email protected]>
www.maxiv.se<http://www.maxiv.se/>

Från: [email protected] 
[mailto:[email protected]] För Knister, Aaron S. 
(GSFC-606.2)[COMPUTER SCIENCE CORP]
Skickat: den 5 december 2017 14:24
Till: gpfsug main discussion list <[email protected]>
Ämne: Re: [gpfsug-discuss] Infiniband connection rejected, ibv_create_qp err 13



Looks like 13 is EPERM which means apparently permissions didn’t exist to 
create the QP of the desired type which is odd since mmfsd runs as root. Is 
there any remote chance SELinux is enabled (e.g. sestatus)? Although I’d think 
mmfsd would run unconfined in the default policy, but maybe it didn’t 
transition correctly.
On December 5, 2017 at 08:16:49 EST, Andreas Mattsson 
<[email protected]<mailto:[email protected]>> wrote:

Hi.



Have anyone here experienced having VERBS RDMA connection request rejects on 
Scale NSD servers with the error message “ibv_create_qp err 13”?

I’m having issues with this on a IBM ESS system.



The error mostly affects only one of the two GSSIO-nodes, and moves with the 
node even if I put all four of the infiniband links on the same infiniband 
switch as the working node is connected to.

The issue affects client nodes in different blade-chassis, going through 
different Infiniband swithes and cables, and also non-blade nodes running a 
slightly different os-setup and different infiniband HCAs.

MPI-jobs on the client nodes can communicate over the infiniband fabric without 
issues.

Upgrading all switches and HCAs to the latest firmware and making sure that 
client nodes have the same OFED-version as the ESS has had no impact on the 
issue.

When the issue is there, I can still do ibping between the nodes, ibroute gives 
me a working and correct path between the nodes that get connection rejects, 
and if I set up IPoIB, ip traffic works on the afflicted interfaces.



I have opened a PMR with IBM on the issue, so asking here is a parallel track 
for trying to find a solution to this.



Any help or suggestions is appreciated.

Regards,

Andreas Mattsson

_____________________________________________

[mid:[email protected]/[email protected]]

Andreas Mattsson
Systems Engineer



MAX IV Laboratory
Lund University
P.O. Box 118, SE-221 00 Lund, Sweden
Visiting address: Fotongatan 2, 225 94 Lund
Mobile: +46 706 64 95 44
[email protected]<mailto:[email protected]>
www.maxiv.se<http://www.maxiv.se/>


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to