Hi, David We are looking into your report.
Best, Josh On Tue, Apr 19, 2016 at 4:41 PM, David Shrader <dshra...@lanl.gov> wrote: > Hello, > > I have been investigating using XRC on a cluster with a mellanox > interconnect. I have found that in a certain situation I get a seg fault. I > am using 1.10.2 compiled with gcc 5.3.0, and the simplest configure line > that I have found that still results in the seg fault is as follows: > > $> ./configure --with-hcoll --with-mxm --prefix=... > > I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system space > (/usr/lib64). If I use '--without-hcoll --without-mxm,' the seg fault does > not happen. > > The seg fault happens even when using examples/hello_c.c, so here is an > example of the seg fault using it: > > $> mpicc hello_c.c -o hello_c.x > $> mpirun -n 1 ./hello_c.x > Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI > dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: > v1.10.1-145-g799148f, Jan 21, 2016, 135) > $> mpirun -n 1 -mca btl_openib_receive_queues > X,4096,1024:X,12288,512:X,65536,512 > Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI > dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: > v1.10.1-145-g799148f, Jan 21, 2016, 135) > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 22819 on node mu0001 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > The seg fault happens no matter the number of ranks. I have tried the > above command with '-mca pml_base_verbose,' and it shows that I am using > the yalla pml: > > $> mpirun -n 1 -mca btl_openib_receive_queues > X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100 ./hello_c.x > ...output snipped... > [mu0001.localdomain:22825] select: component cm not selected / finalized > [mu0001.localdomain:22825] select: component ob1 not selected / finalized > [mu0001.localdomain:22825] select: component yalla selected > ...output snipped... > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 22825 on node mu0001 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > Interestingly enough, if I tell mpirun what pml to use, the seg fault goes > away. The following command does not get the seg fault: > > $> mpirun -n 1 -mca btl_openib_receive_queues > X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x > > Passing either ob1 or cm to '-mca pml' also works. So it seems that the > seg fault comes about when the yalla pml is chosen by default, when > mxm/hcoll is involved, and using XRC. I'm not sure if mxm is to blame, > however, as using '-mca pml cm -mca mtl mxm' with the XRC parameters > doesn't throw the seg fault. > > Other information... > OS: RHEL 6.7-based (TOSS) > OpenFabrics: RedHat provided > Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64 > Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2 which is > attached. > > Is there something else I should be doing with the yalla pml when using > XRC? Regardless, I hope reporting the seg fault is useful. > > Thanks, > David > > -- > David Shrader > HPC-ENV High Performance Computer Systems > Los Alamos National Lab > Email: dshrader <at> lanl.gov > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18786.php >