On Mon, 2009-10-26 at 15:06 -0700, Paul H. Hargrove wrote: > Retrying w/ fewer CQ entires as Jeff describes is a good idea to help > ensure that EINVAL actually does signify that the count exceeds the max > instead of just assuming this is so). If it actually was signifying > some other error case, then one would probably not want to continue.
Sorry for the delay, but I had many other things to do... You'll find a patch proposal in attachment, ready for review. The only part I'm not sure about is the following hunk: @@ -496,7 +540,13 @@ int mca_btl_openib_add_procs( peers[i] = endpoint; } - return mca_btl_openib_size_queues(openib_btl, nprocs); + rc = mca_btl_openib_size_queues(openib_btl, nprocs); + if (OMPI_SUCCESS != rc) { + mca_btl_openib_del_procs(btl, nprocs, ompi_procs, peers); + opal_bitmap_clear_all_bits(reachable); + } + + return rc; Don't know if there's a "less violent" way of undoing things. Anyway, things work well with the path applied. You'll also find in attachment: 1. the output without the patch applied 2. the output with the patch applied 3. the output with the patch applied + an emulation of an EINVAL that is still returned. Comments would be welcome. Regards, Nadia > > -Paul > > Jeff Squyres wrote: > > Thanks for the analysis! > > > > We've argued about btl_r2_add_btls() before -- IIRC, the consensus is > > that we want it to be able to continue even if a BTL fails. So I > > *think* that your #1 answer is better. > > > > However, we might want to try a little harder if EINVAL is returned -- > > perhaps try decreasing number of CQ entries and try again until either > > we have too few CQ entries to be useful (e.g., 0 or some higher number > > that is still "too small"), or fail the BTL alltogether...? > > > > On Oct 23, 2009, at 10:10 AM, Nadia Derbey wrote: > > > >> Hi, > >> > >> Yesterdays I had to analyze a SIGSEV occuring after the following > >> message had been output: > >> [.... adjust_cq] cannot resize completion queue, error: 22 > >> > >> > >> What I found is the following: > >> > >> When ibv_resize_cq() fails to resize a CQ (in my case it returned > >> EINVAL), adjust_cq() returns an error and create_srq() is not called by > >> mca_btl_openib_size_queues(). > >> > >> Note: One of our infiniband specialists told me that EINVAL was returned > >> in that case because we were asking for more CQ entries than the max > >> available. > >> > >> mca_bml_r2_add_btls() goes on executing. > >> > >> Then qp_create_all() is called (connect/btl_openib_connect_oob.c). > >> ibv_create_qp() succeeds even though init_attr.srq is a NULL pointer > >> (remember that create_srq() has not been previously called). > >> > >> Since all the QPs have been successfully created, qp_create_all() then > >> calls: > >> mca_btl_openib_endpoint_post_recvs() > >> --> mca_btl_openib_post_srr() > >> --> ibv_post_srq_recv() on a NULL SRQ > >> ==> SIGSEGV > >> > >> > >> If I'm not wrong in the analysis above, we have the choice between 2 > >> solutions to fix this problem: > >> > >> 1. if EINVAL is returned by ibv_resize_cq() in adjust_cq(), treat this > >> as the ENOSYS case: do not return an error, since the CQ has > >> successfully been created may be with less entries than needed, but it > >> is there. > >> > >> Doing this we assume that EINVAL will always be the symptom of a "too > >> many entries asked for" error from the IB stack. I don't have the > >> answer... > >> + I don't know if this won't imply a degraded mode in terms of > >> performances. > >> > >> 2. Fix mca_bml_r2_add_btls() to cleanly exit if an error occurs during > >> btl_add_procs(). > >> > >> FYI I tested solution #1 and it worked... > >> > >> Any suggestion or comment would be welcome. > >> > >> Regards, > >> Nadia > >> > >> -- > >> Nadia Derbey <nadia.der...@bull.net> > >> > -- Nadia Derbey <nadia.der...@bull.net>
btl/openib: correctly manage ibv_resize_cqueue returning EINVAL When ibv_resize_cqueue() returns EINVAL, retry several times lowering the CQ size (b.c. EINVAL may be due to a CQ size too high). If ever EINVAL is still returned, mca_btl_openib_size_queues() will do too. In that case, since this is a true error, clean everything by calling mca_btl_openib_del_procs(). diff -r cf107f1a397e ompi/mca/btl/openib/btl_openib.c --- a/ompi/mca/btl/openib/btl_openib.c Thu Nov 26 15:05:37 2009 +0100 +++ b/ompi/mca/btl/openib/btl_openib.c Thu Nov 26 15:08:03 2009 +0100 @@ -143,18 +143,24 @@ static inline struct ibv_cq *ibv_create_ #endif } -static int adjust_cq(mca_btl_openib_device_t *device, const int cq) +static int adjust_cq(mca_btl_openib_device_t *device, const int cq, + uint32_t old_cq_size) { uint32_t cq_size = device->cq_size[cq]; /* make sure we don't exceed the maximum CQ size and that we * don't size the queue smaller than otherwise requested + * Keep track of the actual CQ size in the device structure. */ - if(cq_size < mca_btl_openib_component.ib_cq_size[cq]) + if (cq_size < mca_btl_openib_component.ib_cq_size[cq]) { cq_size = mca_btl_openib_component.ib_cq_size[cq]; + device->cq_size[cq] = cq_size; + } - if(cq_size > (uint32_t)device->ib_dev_attr.max_cqe) + if (cq_size > (uint32_t)device->ib_dev_attr.max_cqe) { cq_size = device->ib_dev_attr.max_cqe; + device->cq_size[cq] = cq_size; + } if(NULL == device->ib_cq[cq]) { device->ib_cq[cq] = ibv_create_cq_compat(device->ib_dev_context, cq_size, @@ -198,8 +204,29 @@ static int adjust_cq(mca_btl_openib_devi /* For ConnectX the resize CQ is not implemented and verbs returns -ENOSYS * but should return ENOSYS. So it is reason for abs */ if(rc && ENOSYS != abs(rc)) { - BTL_ERROR(("cannot resize completion queue, error: %d", rc)); - return OMPI_ERROR; + int save_cq_size = cq_size; + + /* + * EINVAL is returned is case the size asked for is too high. + * So try lesser values until we reach the last value that + * has succeeded. + */ + while (EINVAL == abs(rc) && cq_size > old_cq_size) { + cq_size = old_cq_size + ((cq_size - old_cq_size) / 2); + rc = ibv_resize_cq(device->ib_cq[cq], cq_size); + } + if (rc) { + BTL_ERROR(("cannot resize completion queue, error: %d", rc)); + return OMPI_ERROR; + } else { + /* Warn that CQ was not resized as originally asked for. */ + device->cq_size[cq] = cq_size; + orte_show_help("help-mpi-btl-openib.txt", + "CQ resized lower", true, + orte_process_info.nodename, + ibv_get_device_name(device->ib_dev), + save_cq_size, cq_size); + } } } #endif @@ -253,6 +280,23 @@ static int mca_btl_openib_size_queues(st uint32_t send_cqes, recv_cqes; int rc = OMPI_SUCCESS, qp; mca_btl_openib_device_t *device = openib_btl->device; + uint32_t old_cq_size[2]; + + /* + * Save current cq_sizes values to be able to rollback in case of failure. + * This is useful in the following case: + * if adjust_cq() is called twice for the same competion queue, it + * may succeed during the first call, i.e. it creates the cq with the + * appropriate size. Then it may fail during the 2nd call, while + * increasing the cq size. + * Thus we keep track of the sizes used during last call to adjust_cq + * in order to retry with lesser entries if ever we fail inside adjust_cq. + */ + old_cq_size[BTL_OPENIB_HP_CQ] = + openib_btl->device->cq_size[BTL_OPENIB_HP_CQ]; + old_cq_size[BTL_OPENIB_LP_CQ] = + openib_btl->device->cq_size[BTL_OPENIB_LP_CQ]; + /* figure out reasonable sizes for completion queues */ for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++) { @@ -268,12 +312,12 @@ static int mca_btl_openib_size_queues(st openib_btl->device->cq_size[BTL_OPENIB_LP_CQ] += send_cqes; } - rc = adjust_cq(device, BTL_OPENIB_HP_CQ); + rc = adjust_cq(device, BTL_OPENIB_HP_CQ, old_cq_size[BTL_OPENIB_HP_CQ]); if (OMPI_SUCCESS != rc) { goto out; } - rc = adjust_cq(device, BTL_OPENIB_LP_CQ); + rc = adjust_cq(device, BTL_OPENIB_LP_CQ, old_cq_size[BTL_OPENIB_LP_CQ]); if (OMPI_SUCCESS != rc) { goto out; } @@ -496,7 +540,13 @@ int mca_btl_openib_add_procs( peers[i] = endpoint; } - return mca_btl_openib_size_queues(openib_btl, nprocs); + rc = mca_btl_openib_size_queues(openib_btl, nprocs); + if (OMPI_SUCCESS != rc) { + mca_btl_openib_del_procs(btl, nprocs, ompi_procs, peers); + opal_bitmap_clear_all_bits(reachable); + } + + return rc; } /* diff -r cf107f1a397e ompi/mca/btl/openib/help-mpi-btl-openib.txt --- a/ompi/mca/btl/openib/help-mpi-btl-openib.txt Thu Nov 26 15:05:37 2009 +0100 +++ b/ompi/mca/btl/openib/help-mpi-btl-openib.txt Thu Nov 26 15:08:03 2009 +0100 @@ -590,3 +590,13 @@ value will be ignored. Local host: %s Value: %s Message: %s +# +[CQ resized lower] +WARNING: Could not resize CQ to the size originally asked for. + + Local host: %s + Device name: %s + Size asked for: %d + Actual CQ size: %d + +This may result in lower performance.
[derbeyn@inti0 ~]$ salloc -n 16 -N 2 -p Zeus mpirun --mca btl openib,self /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 -npmin 16 sendrecv salloc: Granted job allocation 90732 [inti42][[4571,1],13][../../../../../ompi/mca/btl/openib/btl_openib.c:201:adjust_cq] cannot resize completion queue, error: 22 [inti41][[4571,1],6][../../../../../ompi/mca/btl/openib/btl_openib.c:201:adjust_cq] cannot resize completion queue, error: 22 #--------------------------------------------------- # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part #--------------------------------------------------- # Date : Thu Nov 26 15:52:27 2009 # Machine : x86_64# System : Linux # Release : 2.6.18-128.el5.Bull.3 # Version : #1 SMP Fri Feb 13 10:09:19 CET 2009 # # Minimum message length in bytes: 0 # Maximum message length in bytes: 16777216 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Sendrecv [inti41:06482] *** Process received signal *** [inti41:06482] Signal: Segmentation fault (11) [inti41:06482] Signal code: Address not mapped (1) [inti41:06482] Failing at address: (nil) [inti41:06482] [ 0] /lib64/libpthread.so.0 [0x305d00de60] [inti41:06482] [ 1] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2aac3d401597] [inti41:06482] [ 2] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2aac3d409e2c] [inti41:06482] [ 3] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2aac3d4134c5] [inti41:06482] [ 4] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_rml_oob.so [0x2aac3b1868a1] [inti41:06482] [ 5] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2aac3b3901a0] [inti41:06482] [ 6] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2aac3b3914ca] [inti41:06482] [ 7] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libopen-pal.so.0 [0x2aac3a908fcb] [inti41:06482] [ 8] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aac3a8f57fe] [inti41:06482] [ 9] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libmpi.so.0 [0x2aac3a418035] [inti41:06482] [10] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2aac3e67ed55] [inti41:06482] [11] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2aac3e67eed7] [inti41:06482] [12] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2aac3e674d7f] [inti41:06482] [13] /home_nfs/derbeyn/DISTS/openmpi-default/lib/openmpi/mca_coll_sync.so [0x2aac3e4712d9] [inti41:06482] [14] /home_nfs/derbeyn/DISTS/openmpi-default/lib/libmpi.so.0(MPI_Bcast+0x171) [0x2aac3a4241b1] [inti41:06482] [15] /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1(IMB_basic_input+0x956) [0x4042d6] [inti41:06482] [16] /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1(main+0x6b) [0x402eab] [inti41:06482] [17] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305c41d8a4] [inti41:06482] [18] /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 [0x402d89] [inti41:06482] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 6 with PID 6482 on node inti41 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- salloc: Relinquishing job allocation 90732
[derbeyn@inti0 ~]$ salloc -n 16 -N 2 -p Zeus mpirun --mca btl openib,self /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 -npmin 16 sendrecv salloc: Granted job allocation 90737 -------------------------------------------------------------------------- WARNING: Could not resize CQ to the size originally asked for. Local host: inti41 Device name: mthca0 Size asked for: 9344 Actual CQ size: 7008 This may result in lower performance. -------------------------------------------------------------------------- #--------------------------------------------------- # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part #--------------------------------------------------- # Date : Thu Nov 26 15:55:03 2009 # Machine : x86_64# System : Linux # Release : 2.6.18-128.el5.Bull.3 # Version : #1 SMP Fri Feb 13 10:09:19 CET 2009 # # Minimum message length in bytes: 0 # Maximum message length in bytes: 16777216 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Sendrecv #----------------------------------------------------------------------------- # Benchmarking Sendrecv # #processes = 16 #----------------------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 30.91 31.07 30.99 0.00 1 1000 30.15 30.32 30.24 0.06 2 1000 29.79 30.04 29.96 0.13 4 1000 29.38 29.56 29.47 0.26 8 1000 39.45 39.60 39.55 0.39 16 1000 29.22 29.38 29.32 1.04 32 1000 29.44 29.97 29.85 2.04 64 1000 39.91 41.17 40.51 2.97 128 1000 38.99 39.62 39.47 6.16 256 1000 28.58 28.81 28.72 16.95 512 1000 29.67 29.85 29.77 32.72 1024 1000 42.02 42.18 42.07 46.31 2048 1000 46.98 47.27 47.16 82.64 4096 1000 47.91 48.25 48.12 161.93 8192 1000 88.36 88.62 88.49 176.31 16384 1000 254.80 255.03 254.96 122.53 32768 1000 360.07 361.14 360.73 173.06 65536 640 561.28 574.89 571.20 217.43 [inti0:22534] 3 more processes have sent help message help-mpi-btl-openib.txt / CQ resized lower [inti0:22534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 131072 320 1054.02 1077.62 1069.75 231.99 262144 160 2098.44 2128.30 2119.77 234.93 524288 80 4234.15 4258.82 4250.09 234.81 1048576 40 8390.82 8463.80 8435.44 236.30 2097152 20 16565.85 16796.04 16712.02 238.15 4194304 10 32637.41 33899.59 33395.27 235.99 8388608 5 61666.01 68091.20 65520.32 234.98 16777216 2 129991.41 138666.51 134618.62 230.77 salloc: Relinquishing job allocation 90737
[derbeyn@inti0 ~]$ salloc -n 16 -N 2 -p Zeus mpirun --mca btl openib,self /home_nfs/derbeyn/Bull-vs2//opt/IMB/IMB-MPI1 -npmin 16 sendrecv salloc: Granted job allocation 90741 [inti42][[6518,1],11][../../../../../ompi/mca/btl/openib/btl_openib.c:220:adjust_cq] cannot resize completion queue, error: 22 [inti41][[6518,1],4][../../../../../ompi/mca/btl/openib/btl_openib.c:220:adjust_cq] cannot resize completion queue, error: 22 [inti41][[6518,1],4][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:788:rml_recv_cb] can't find suitable endpoint for this peer -------------------------------------------------------------------------- mpirun has exited due to process rank 4 with PID 6569 on node inti41 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- salloc: Relinquishing job allocation 90741 This result has been obtained applying the following patch to emulate an unconditional EINVAL. btl/openib: emulate persisting EINVAL diff -r bd820c9c0415 ompi/mca/btl/openib/btl_openib.c --- a/ompi/mca/btl/openib/btl_openib.c Thu Nov 26 15:53:22 2009 +0100 +++ b/ompi/mca/btl/openib/btl_openib.c Thu Nov 26 15:59:03 2009 +0100 @@ -214,6 +214,7 @@ static int adjust_cq(mca_btl_openib_devi while (EINVAL == abs(rc) && cq_size > old_cq_size) { cq_size = old_cq_size + ((cq_size - old_cq_size) / 2); rc = ibv_resize_cq(device->ib_cq[cq], cq_size); +rc = EINVAL; } if (rc) { BTL_ERROR(("cannot resize completion queue, error: %d", rc));