Hi, Yesterdays I had to analyze a SIGSEV occuring after the following message had been output: [.... adjust_cq] cannot resize completion queue, error: 22
What I found is the following: When ibv_resize_cq() fails to resize a CQ (in my case it returned EINVAL), adjust_cq() returns an error and create_srq() is not called by mca_btl_openib_size_queues(). Note: One of our infiniband specialists told me that EINVAL was returned in that case because we were asking for more CQ entries than the max available. mca_bml_r2_add_btls() goes on executing. Then qp_create_all() is called (connect/btl_openib_connect_oob.c). ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer (remember that create_srq() has not been previously called). Since all the QPs have been successfully created, qp_create_all() then calls: mca_btl_openib_endpoint_post_recvs() --> mca_btl_openib_post_srr() --> ibv_post_srq_recv() on a NULL SRQ ==> SIGSEGV If I'm not wrong in the analysis above, we have the choice between 2 solutions to fix this problem: 1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this as the ENOSYS case: do not return an error, since the CQ has successfully been created may be with less entries than needed, but it is there. Doing this we assume that EINVAL will always be the symptom of a "too many entries asked for" error from the IB stack. I don't have the answer... + I don't know if this won't imply a degraded mode in terms of performances. 2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during btl_add_procs(). FYI I tested solution #1 and it worked... Any suggestion or comment would be welcome. Regards, Nadia -- Nadia Derbey <nadia.der...@bull.net>