Hi, Our validation team detected a hang when running osu_bibw micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the same hang appears with openmpi-3.0). This hang occurs when calling osu_bibw on a single node (vader btl) with the options "-x 100 -i 1000". The -x option changes the warmup loop size. The -i option changes the measured loop size.
For each exchanged message size, osu_bibw loops doing the following sequence on both ranks: . posts 64 non-blocking sends . posts 64 non-blocking receives . waits for all the send requests to complete . waits for all the receive requests to complete The loop size is the sum of . options.skip (warm up phase that can be changed with the -x option) . options.loop (actually measured loop that can be changed with the -i option). The default values are the following: +==============+======+======+ | message size | skip | loop | |==============+======+======| | <= 8K | 10 | 100 | | > 8K | 2 | 20 | +==============+======+======+ As said above, the test hangs when moving to more aggressive loop values: 100 for skip and 1000 for loop. mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment from the appropriate free list. If there are no free fragments anymore, opal_free_list_get() calls opal_free_list_grow() which in turn calls mca_btl_vader_frag_init() (initialization routine for the vader btl fragements). This routine checks if there is enough space left in the mapped memory segment for the wanted fragment size (current offset + fragment size shoudl be <= segment size), and it makes opal_free_list_grow fail if the shared memory segment is exhausted. As soon as we begin exhausting memory, the 2 ranks get unsynchronized and the test rapidly hangs. To avoid this hang, I found 2 possible solutions: 1) change the vader btl segment size: I have set it to 4GB - in order to be able to do this, I had to change the type parameter in the parameter registrations to MCA_BASE_VAR_TYPE_SIZE_T. 2) change the call to opal_free_list_get() by a call to opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the micro-benchmark run to the end. So my question is: what would be the best approach (#1 or #2)? and the question behind this is: what is the reason that makes favoring opal_free_list_get() instead of opal_free_list_wait(). Thanks -- Nadia Derbey - B1-387 HPC R&D - MPI Tel: +33 4 76 29 77 62 nadia.der...@atos.net 1 Rue de Provence BP 208 38130 Echirolles Cedex, France www.atos.com _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel