Should be fixed by PR #4569 (https://github.com/open-mpi/ompi/pull/4569). Please treat and let me know.
-Nathan > On Dec 1, 2017, at 7:37 AM, DERBEY, NADIA <nadia.der...@atos.net> wrote: > > Hi, > > Our validation team detected a hang when running osu_bibw > micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the > same hang appears with openmpi-3.0). > This hang occurs when calling osu_bibw on a single node (vader btl) with > the options "-x 100 -i 1000". > The -x option changes the warmup loop size. > The -i option changes the measured loop size. > > For each exchanged message size, osu_bibw loops doing the following > sequence on both ranks: > . posts 64 non-blocking sends > . posts 64 non-blocking receives > . waits for all the send requests to complete > . waits for all the receive requests to complete > > The loop size is the sum of > . options.skip (warm up phase that can be changed with the -x option) > . options.loop (actually measured loop that can be changed with the > -i option). > > The default values are the following: > > +==============+======+======+ > | message size | skip | loop | > |==============+======+======| > | <= 8K | 10 | 100 | > | > 8K | 2 | 20 | > +==============+======+======+ > > As said above, the test hangs when moving to more aggressive loop > values: 100 for skip and 1000 for loop. > > mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment > from the appropriate free list. > If there are no free fragments anymore, opal_free_list_get() calls > opal_free_list_grow() which in turn calls mca_btl_vader_frag_init() > (initialization routine for the vader btl fragements). > This routine checks if there is enough space left in the mapped memory > segment for the wanted fragment size (current offset + fragment size > shoudl be <= segment size), and it makes opal_free_list_grow fail if the > shared memory segment is exhausted. > > As soon as we begin exhausting memory, the 2 ranks get unsynchronized > and the test rapidly hangs. To avoid this hang, I found 2 possible > solutions: > > 1) change the vader btl segment size: I have set it to 4GB - in order to > be able to do this, I had to change the type parameter in the parameter > registrations to MCA_BASE_VAR_TYPE_SIZE_T. > > 2) change the call to opal_free_list_get() by a call to > opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the > micro-benchmark run to the end. > > So my question is: what would be the best approach (#1 or #2)? and the > question behind this is: what is the reason that makes favoring > opal_free_list_get() instead of opal_free_list_wait(). > > Thanks > > -- > Nadia Derbey - B1-387 > HPC R&D - MPI > Tel: +33 4 76 29 77 62 > nadia.der...@atos.net > 1 Rue de Provence BP 208 > 38130 Echirolles Cedex, France > www.atos.com > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel