Hi list,

We are currently experiencing deadlocks when using communicators other than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then MPI_Barrier on the communicator - see end of e-mail).

We can reproduce the deadlock only with openib and with at least 8 cores (no success with sm) and after ~20 runs average. Using larger number of cores greatly increases the occurence of the deadlock. When the deadlock occurs, every even process is stuck in MPI_Finalize and every odd process is in MPI_Barrier.

So we tracked the bug in the changesets and found out that this patch seem to have introduced the bug :

user:        brbarret
date:        Tue Aug 25 15:13:31 2009 +0000
summary:     Per discussion in ticket #2009, temporarily disable the block CID 
allocation
algorithms until they properly reuse CIDs.

Reverting to the non multi-thread cid allocator makes the deadlock disappear.

I tried to dig further and understand why this makes a difference, with no luck.

If anyone can figure out what's happening, that would be great ...

Thanks,
Sylvain

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv) {
    int rank, numTasks;
    int range[3];
    MPI_Comm testComm, dupComm;
    MPI_Group orig_group, new_group;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
    MPI_Comm_group(MPI_COMM_WORLD, &orig_group);
    range[0] = 0; /* first rank */
    range[1] = numTasks - 1; /* last rank */
    range[2] = 1; /* stride */
    MPI_Group_range_incl(orig_group, 1, &range, &new_group);
    MPI_Comm_create(MPI_COMM_WORLD, new_group, &testComm);
    MPI_Barrier(testComm);
    MPI_Finalize();
    return 0;
}

Reply via email to