Hi, David

We are looking into your report.

Best,

Josh

On Tue, Apr 19, 2016 at 4:41 PM, David Shrader <dshra...@lanl.gov> wrote:

> Hello,
>
> I have been investigating using XRC on a cluster with a mellanox
> interconnect. I have found that in a certain situation I get a seg fault. I
> am using 1.10.2 compiled with gcc 5.3.0, and the simplest configure line
> that I have found that still results in the seg fault is as follows:
>
> $> ./configure --with-hcoll --with-mxm --prefix=...
>
> I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system space
> (/usr/lib64). If I use '--without-hcoll --without-mxm,' the seg fault does
> not happen.
>
> The seg fault happens even when using examples/hello_c.c, so here is an
> example of the seg fault using it:
>
> $> mpicc hello_c.c -o hello_c.x
> $> mpirun -n 1 ./hello_c.x
> Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI
> dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev:
> v1.10.1-145-g799148f, Jan 21, 2016, 135)
> $> mpirun -n 1 -mca btl_openib_receive_queues
> X,4096,1024:X,12288,512:X,65536,512
> Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI
> dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev:
> v1.10.1-145-g799148f, Jan 21, 2016, 135)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 22819 on node mu0001 exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> The seg fault happens no matter the number of ranks. I have tried the
> above command with '-mca pml_base_verbose,' and it shows that I am using
> the yalla pml:
>
> $> mpirun -n 1 -mca btl_openib_receive_queues
> X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100 ./hello_c.x
> ...output snipped...
> [mu0001.localdomain:22825] select: component cm not selected / finalized
> [mu0001.localdomain:22825] select: component ob1 not selected / finalized
> [mu0001.localdomain:22825] select: component yalla selected
> ...output snipped...
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 22825 on node mu0001 exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> Interestingly enough, if I tell mpirun what pml to use, the seg fault goes
> away. The following command does not get the seg fault:
>
> $> mpirun -n 1 -mca btl_openib_receive_queues
> X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x
>
> Passing either ob1 or cm to '-mca pml' also works. So it seems that the
> seg fault comes about when the yalla pml is chosen by default, when
> mxm/hcoll is involved, and using XRC. I'm not sure if mxm is to blame,
> however, as using '-mca pml cm -mca mtl mxm' with the XRC parameters
> doesn't throw the seg fault.
>
> Other information...
> OS: RHEL 6.7-based (TOSS)
> OpenFabrics: RedHat provided
> Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64
> Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2 which is
> attached.
>
> Is there something else I should be doing with the yalla pml when using
> XRC? Regardless, I hope reporting the seg fault is useful.
>
> Thanks,
> David
>
> --
> David Shrader
> HPC-ENV High Performance Computer Systems
> Los Alamos National Lab
> Email: dshrader <at> lanl.gov
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18786.php
>

Reply via email to