Hi,

Our validation team detected a hang when running osu_bibw 
micro-benchmarks from the OMB 5.3 suite on openmpi 2.0.2 (note that the 
same hang appears with openmpi-3.0).
This hang occurs when calling osu_bibw on a single node (vader btl) with 
the options "-x 100 -i 1000".
The -x option changes the warmup loop size.
The -i option changes the measured loop size.

For each exchanged message size, osu_bibw loops doing the following 
sequence on both ranks:
    . posts 64 non-blocking sends
    . posts 64 non-blocking receives
    . waits for all the send requests to complete
    . waits for all the receive requests to complete

The loop size is the sum of
    . options.skip (warm up phase that can be changed with the -x option)
    . options.loop (actually measured loop that can be changed with the 
-i option).

The default values are the following:

+==============+======+======+
| message size | skip | loop |
|==============+======+======|
|    <= 8K     |   10 |  100 |
|    >  8K     |    2 |   20 |
+==============+======+======+

As said above, the test hangs when moving to more aggressive loop 
values: 100 for skip and 1000 for loop.

mca_btl_vader_frag_alloc() calls opal_free_list_get() to get a fragment 
from the appropriate free list.
If there are no free fragments anymore, opal_free_list_get() calls 
opal_free_list_grow() which in turn calls mca_btl_vader_frag_init() 
(initialization routine for the vader btl fragements).
This routine checks if there is enough space left in the mapped memory 
segment for the wanted fragment size (current offset + fragment size 
shoudl be <= segment size), and it makes opal_free_list_grow fail if the 
shared memory segment is exhausted.

As soon as we begin exhausting memory, the 2 ranks get unsynchronized 
and the test rapidly hangs. To avoid this hang, I found 2 possible 
solutions:

1) change the vader btl segment size: I have set it to 4GB - in order to 
be able to do this, I had to change the type parameter in the parameter 
registrations to MCA_BASE_VAR_TYPE_SIZE_T.

2) change the call to opal_free_list_get() by a call to 
opal_free_list_wait() in mca_btl_vader_frag_alloc(). This also makes the 
micro-benchmark run to the end.

So my question is: what would be the best approach (#1 or #2)? and the 
question behind this is: what is the reason that makes favoring 
opal_free_list_get() instead of opal_free_list_wait().

Thanks

-- 
Nadia Derbey - B1-387
HPC R&D - MPI
Tel: +33 4 76 29 77 62
nadia.der...@atos.net
1 Rue de Provence BP 208
38130 Echirolles Cedex, France
www.atos.com
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to