Hi,

Yesterdays I had to analyze a SIGSEV occuring after the following
message had been output:
[.... adjust_cq] cannot resize completion queue, error: 22


What I found is the following:

When ibv_resize_cq() fails to resize a CQ (in my case it returned
EINVAL), adjust_cq() returns an error and create_srq() is not called by
mca_btl_openib_size_queues().

Note: One of our infiniband specialists told me that EINVAL was returned
in that case because we were asking for more CQ entries than the max
available.

mca_bml_r2_add_btls() goes on executing.

Then qp_create_all() is called (connect/btl_openib_connect_oob.c).
ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer
(remember that create_srq() has not been previously called).

Since all the QPs have been successfully created, qp_create_all() then
calls:
mca_btl_openib_endpoint_post_recvs()
  --> mca_btl_openib_post_srr()
      --> ibv_post_srq_recv() on a NULL SRQ
==> SIGSEGV


If I'm not wrong in the analysis above, we have the choice between 2
solutions to fix this problem:

1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this
as the ENOSYS case: do not return an error, since the CQ has
successfully been created may be with less entries than needed, but it
is there.

Doing this we assume that EINVAL will always be the symptom of a "too
many entries asked for" error from the IB stack. I don't have the
answer...
+ I don't know if this won't imply a degraded mode in terms of
performances.

2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during 
btl_add_procs().

FYI I tested solution #1 and it worked...

Any suggestion or comment would be welcome.

Regards,
Nadia

-- 
Nadia Derbey <nadia.der...@bull.net>

Reply via email to