Hi list,
We are currently experiencing deadlocks when using communicators other
than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then
MPI_Barrier on the communicator - see end of e-mail).
We can reproduce the deadlock only with openib and with at least 8 cores
(no success with sm) and after ~20 runs average. Using larger number of
cores greatly increases the occurence of the deadlock. When the deadlock
occurs, every even process is stuck in MPI_Finalize and every odd process
is in MPI_Barrier.
So we tracked the bug in the changesets and found out that this patch seem
to have introduced the bug :
user: brbarret
date: Tue Aug 25 15:13:31 2009 +0000
summary: Per discussion in ticket #2009, temporarily disable the block CID
allocation
algorithms until they properly reuse CIDs.
Reverting to the non multi-thread cid allocator makes the deadlock
disappear.
I tried to dig further and understand why this makes a difference, with no
luck.
If anyone can figure out what's happening, that would be great ...
Thanks,
Sylvain
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv) {
int rank, numTasks;
int range[3];
MPI_Comm testComm, dupComm;
MPI_Group orig_group, new_group;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
MPI_Comm_group(MPI_COMM_WORLD, &orig_group);
range[0] = 0; /* first rank */
range[1] = numTasks - 1; /* last rank */
range[2] = 1; /* stride */
MPI_Group_range_incl(orig_group, 1, &range, &new_group);
MPI_Comm_create(MPI_COMM_WORLD, new_group, &testComm);
MPI_Barrier(testComm);
MPI_Finalize();
return 0;
}