I fixed the binding algorithm so it shifts the location to be more of what you expected. However, we still won't bind the final spawn if there aren't enough free cores to support those procs.
On Jun 5, 2014, at 7:12 AM, Hjelm, Nathan T <hje...@lanl.gov> wrote: > Coll/ml does disqualify itself if processes are not bound. The problem here > is there is an inconsistency between the two sides of the intercommunicator. > I can write a quick fix for 1.8.2. > > -Nathan > ________________________________________ > From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet > [gilles.gouaillar...@gmail.com] > Sent: Thursday, June 05, 2014 1:20 AM > To: Open MPI Developers > Subject: [OMPI devel] MPI_Comm_spawn affinity and coll/ml > > Folks, > > on my single socket four cores VM (no batch manager), i am running the > intercomm_create test from the ibm test suite. > > mpirun -np 1 ./intercomm_create > => OK > > mpirun -np 2 ./intercomm_create > => HANG :-( > > mpirun -np 2 --mca coll ^ml ./intercomm_create > => OK > > basically, this first two tasks will call twice MPI_Comm_spawn(2 tasks) > followed by MPI_Intercomm_merge > and the 4 spawned tasks will call MPI_Intercomm_merge followed by > MPI_Intercomm_create > > i digged a bit into that issue and found two distinct issues : > > 1) binding : > tasks [0-1] (launched with mpirun) are bound on cores [0-1] => OK > tasks[2-3] (first spawn) are bound on cores [0-1] => ODD, i would have > expected [2-3] > tasks[4-5] (second spawn) are not bound at all => ODD again, could have made > sense if tasks[2-3] were bound on cores [2-3] > i observe the same behaviour with the --oversubscribe mpirun parameter > > 2) coll/ml > coll/ml hangs when -np 2 (total 6 tasks, including 2 unbound tasks) > i suspect coll/ml is unable to handle unbound tasks. > if i am correct, should coll/ml detect this and simply automatically > disqualify itself ? > > Cheers, > > Gilles > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14980.php