We are seeing some fork issues with a simple MPI program (attached) running on
a 2.6.16+ kernels and
OFED 1.1. We have tried both Intel MPI and mvapich2 with the same results:
t_fork> mpiexec -n 2 t_system_fork
parent process
[0] started child process with pid=31552
send desc error
parent process
[0] Abort: [] Got completion with error 1, vendor code=69, dest rank=1
at line 540 in file ibv_channel_manager.c
[1] I am child process with pid=25437
[1] started child process with pid=25437
[0] I am child process with pid=31552
child process
[1] finished pid=25437
child process
[0] finished pid=31552
rank 0 in job 2 svlmpicl400_32925 caused collective abort of all ranks
exit status of rank 0: return code 252
If you run mvapich2 for uDAPL, it hangs before second MPI_Barrier() just like
Intel MPI. If you use
the I_MPI_RDMA_USE_EVD_FALLBACK=1 option with Intel MPI you get the following
error similar to
mvapich2:
parent process
parent process
[0] I am child process with pid=9596
[0] started child process with pid=9596
[1] I am child process with pid=11477
[1] started child process with pid=11477
[0][rdma_iba.c:1007] Intel MPI fatal error: DTO operation completed with error.
status=0x2.
cookie=0x1
[1][rdma_iba.c:1007] Intel MPI fatal error: DTO operation completed with error.
status=0x2.
cookie=0x1
child process
[1] finished pid=11477
child process
[0] finished pid=9596
rank 0 in job 8 cst-19_54707 caused collective abort of all ranks
exit status of rank 0: return code 255
Any insight would be greatly appreciated. It was our assumption that the parent
process can continue
to use IB resources after the fixes went into 2.6.16 and OFED 1.1. Is this
true?
Thanks,
-arlin
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
int main(int argc,char *argv[])
{
int myid, numprocs;
pid_t pid;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Barrier(MPI_COMM_WORLD);
system("echo parent process");
pid = fork();
if( pid == 0)
{
pid = getpid();
printf("[%d] I am child process with pid=%d\n", myid, pid);
system("echo child process");
} else
{
printf("[%d] started child process with pid=%d\n", myid, pid);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
pid = getpid();
}
printf("[%d] finished pid=%d\n", myid, pid);
return 0;
}
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general