Hello,

I have been investigating using XRC on a cluster with a mellanox interconnect. I have found that in a certain situation I get a seg fault. I am using 1.10.2 compiled with gcc 5.3.0, and the simplest configure line that I have found that still results in the seg fault is as follows:

$> ./configure --with-hcoll --with-mxm --prefix=...

I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system space (/usr/lib64). If I use '--without-hcoll --without-mxm,' the seg fault does not happen.

The seg fault happens even when using examples/hello_c.c, so here is an example of the seg fault using it:

$> mpicc hello_c.c -o hello_c.x
$> mpirun -n 1 ./hello_c.x
Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135) $> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: v1.10.1-145-g799148f, Jan 21, 2016, 135)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 22819 on node mu0001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The seg fault happens no matter the number of ranks. I have tried the above command with '-mca pml_base_verbose,' and it shows that I am using the yalla pml:

$> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100 ./hello_c.x
...output snipped...
[mu0001.localdomain:22825] select: component cm not selected / finalized
[mu0001.localdomain:22825] select: component ob1 not selected / finalized
[mu0001.localdomain:22825] select: component yalla selected
...output snipped...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 22825 on node mu0001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Interestingly enough, if I tell mpirun what pml to use, the seg fault goes away. The following command does not get the seg fault:

$> mpirun -n 1 -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x

Passing either ob1 or cm to '-mca pml' also works. So it seems that the seg fault comes about when the yalla pml is chosen by default, when mxm/hcoll is involved, and using XRC. I'm not sure if mxm is to blame, however, as using '-mca pml cm -mca mtl mxm' with the XRC parameters doesn't throw the seg fault.

Other information...
OS: RHEL 6.7-based (TOSS)
OpenFabrics: RedHat provided
Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64
Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2 which is attached.

Is there something else I should be doing with the yalla pml when using XRC? Regardless, I hope reporting the seg fault is useful.

Thanks,
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

Attachment: ompi.tar.bz2
Description: application/bzip

Reply via email to