Coll/ml does disqualify itself if processes are not bound. The problem here is 
there is an inconsistency between the two sides of the intercommunicator. I can 
write a quick fix for 1.8.2.

-Nathan
________________________________________
From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
[gilles.gouaillar...@gmail.com]
Sent: Thursday, June 05, 2014 1:20 AM
To: Open MPI Developers
Subject: [OMPI devel] MPI_Comm_spawn affinity and coll/ml

Folks,

on my single socket four cores VM (no batch manager), i am running the 
intercomm_create test from the ibm test suite.

mpirun -np 1 ./intercomm_create
=> OK

mpirun -np 2 ./intercomm_create
=> HANG :-(

mpirun -np 2 --mca coll ^ml  ./intercomm_create
=> OK

basically, this first two tasks will call twice MPI_Comm_spawn(2 tasks) 
followed by MPI_Intercomm_merge
and the 4 spawned tasks will call MPI_Intercomm_merge followed by 
MPI_Intercomm_create

i digged a bit into that issue and found two distinct issues :

1) binding :
tasks [0-1] (launched with mpirun) are bound on cores [0-1] => OK
tasks[2-3] (first spawn) are bound on cores [0-1] => ODD, i would have expected 
[2-3]
tasks[4-5] (second spawn) are not bound at all => ODD again, could have made 
sense if tasks[2-3] were bound on cores [2-3]
i observe the same behaviour  with the --oversubscribe mpirun parameter

2) coll/ml
coll/ml hangs when -np 2 (total 6 tasks, including 2 unbound tasks)
i suspect coll/ml is unable to handle unbound tasks.
if i am correct, should coll/ml detect this and simply automatically disqualify 
itself ?

Cheers,

Gilles

Reply via email to