Hello,I have been investigating using XRC on a cluster with a mellanox interconnect. I have found that in a certain situation I get a seg fault. I am using 1.10.2 compiled with gcc 5.3.0, and the simplest configure line that I have found that still results in the seg fault is as follows:
$> ./configure --with-hcoll --with-mxm --prefix=...I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system space (/usr/lib64). If I use '--without-hcoll --without-mxm,' the seg fault does not happen.
The seg fault happens even when using examples/hello_c.c, so here is an example of the seg fault using it:
$> mpicc hello_c.c -o hello_c.x $> mpirun -n 1 ./hello_c.xHello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135)
--------------------------------------------------------------------------mpirun noticed that process rank 0 with PID 22819 on node mu0001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------The seg fault happens no matter the number of ranks. I have tried the above command with '-mca pml_base_verbose,' and it shows that I am using the yalla pml:
$> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100 ./hello_c.x
...output snipped... [mu0001.localdomain:22825] select: component cm not selected / finalized [mu0001.localdomain:22825] select: component ob1 not selected / finalized [mu0001.localdomain:22825] select: component yalla selected ...output snipped... --------------------------------------------------------------------------mpirun noticed that process rank 0 with PID 22825 on node mu0001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------Interestingly enough, if I tell mpirun what pml to use, the seg fault goes away. The following command does not get the seg fault:
$> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x
Passing either ob1 or cm to '-mca pml' also works. So it seems that the seg fault comes about when the yalla pml is chosen by default, when mxm/hcoll is involved, and using XRC. I'm not sure if mxm is to blame, however, as using '-mca pml cm -mca mtl mxm' with the XRC parameters doesn't throw the seg fault.
Other information... OS: RHEL 6.7-based (TOSS) OpenFabrics: RedHat provided Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2 which is attached.
Is there something else I should be doing with the yalla pml when using XRC? Regardless, I hope reporting the seg fault is useful.
Thanks, David -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader <at> lanl.gov
ompi.tar.bz2
Description: application/bzip