David, thanks for the info you provided. I will try to dig in further to see what might be causing this issue.
In the meantime, maybe Nathan can please comment about the openib btl behavior here? Thanks, Alina. On Wed, Apr 20, 2016 at 8:01 PM, David Shrader <dshra...@lanl.gov> wrote: > Hello Alina, > > Thank you for the information about how the pml components work. I knew > that the other components were being opened and ultimately closed in favor > of yalla, but I didn't realize that initial open would cause a persistent > change in the ompi runtime. > > Here's the information you requested about the ib network: > > - MOFED version: > We are using the Open Fabrics Software as bundled by RedHat, and my ib > network folks say we're running something close to v1.5.4 > - ibv_devinfo: > [dshrader@mu0001 examples]$ ibv_devinfo > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.9.1000 > node_guid: 0025:90ff:ff16:78d8 > sys_image_guid: 0025:90ff:ff16:78db > vendor_id: 0x02c9 > vendor_part_id: 26428 > hw_ver: 0xB0 > board_id: SM_2121000001000 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 250 > port_lid: 366 > port_lmc: 0x00 > link_layer: InfiniBand > > > I still get the seg fault when specifying the hca: > > $> mpirun -n 1 -mca btl_openib_receive_queues > X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include mlx4_0 > ./hello_c.x > Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI > dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: > v1.10.1-145-g799148f, Jan 21, 2016, 135) > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 10045 on node mu0001 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > I don't know if this helps, but the first time I tried the command I > mistyped the hca name. This got me a warning, but no seg fault: > > $> mpirun -n 1 -mca btl_openib_receive_queues > X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include ml4_0 > ./hello_c.x > -------------------------------------------------------------------------- > WARNING: One or more nonexistent OpenFabrics devices/ports were > specified: > > Host: mu0001 > MCA parameter: mca_btl_if_include > Nonexistent entities: ml4_0 > > These entities will be ignored. You can disable this warning by > setting the btl_openib_warn_nonexistent_if MCA parameter to 0. > -------------------------------------------------------------------------- > Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI > dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: > v1.10.1-145-g799148f, Jan 21, 2016, 135) > > So, telling the openib btl to use the actual hca didn't get the seg fault > to go away, but giving it a dummy value did. > > Thanks, > David > > > On 04/20/2016 08:13 AM, Alina Sklarevich wrote: > > Hi David, > > I was able to reproduce the issue you reported. > > When the command line doesn't specify the components to use, ompi will try > to load/open all the ones available (and close them in the end) and then > choose the components according to their priority and whether or not they > were opened successfully. > This means that even if pml yalla was the one running, other components > were opened and closed as well. > > The parameter you are using - btl_openib_receive_queues, doesn't have an > effect on pml yalla. It only affects the openib btl which is used by pml > ob1. > > Using the verbosity of btl_base_verbose I see that when the segmentation > fault happens, the code doesn't reach the phase of unloading the openib btl > so perhaps the problem originates there (since pml yalla was already > unloaded). > > Can you please try adding this mca parameter to your command line to > specify the HCA you are using? > -mca btl_openib_if_include <hca> > It made the segv go away for me. > > Can you please attach the output of ibv_devinfo and write the MOFED > version you are using? > > Thank you, > Alina. > > On Wed, Apr 20, 2016 at 2:53 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > >> Hi, David >> >> We are looking into your report. >> >> Best, >> >> Josh >> >> On Tue, Apr 19, 2016 at 4:41 PM, David Shrader < <dshra...@lanl.gov> >> dshra...@lanl.gov> wrote: >> >>> Hello, >>> >>> I have been investigating using XRC on a cluster with a mellanox >>> interconnect. I have found that in a certain situation I get a seg fault. I >>> am using 1.10.2 compiled with gcc 5.3.0, and the simplest configure line >>> that I have found that still results in the seg fault is as follows: >>> >>> $> ./configure --with-hcoll --with-mxm --prefix=... >>> >>> I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system space >>> (/usr/lib64). If I use '--without-hcoll --without-mxm,' the seg fault does >>> not happen. >>> >>> The seg fault happens even when using examples/hello_c.c, so here is an >>> example of the seg fault using it: >>> >>> $> mpicc hello_c.c -o hello_c.x >>> $> mpirun -n 1 ./hello_c.x >>> Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI >>> dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: >>> v1.10.1-145-g799148f, Jan 21, 2016, 135) >>> $> mpirun -n 1 -mca btl_openib_receive_queues >>> X,4096,1024:X,12288,512:X,65536,512 >>> Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI >>> dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: >>> v1.10.1-145-g799148f, Jan 21, 2016, 135) >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 22819 on node mu0001 exited >>> on signal 11 (Segmentation fault). >>> >>> -------------------------------------------------------------------------- >>> >>> The seg fault happens no matter the number of ranks. I have tried the >>> above command with '-mca pml_base_verbose,' and it shows that I am using >>> the yalla pml: >>> >>> $> mpirun -n 1 -mca btl_openib_receive_queues >>> X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100 ./hello_c.x >>> ...output snipped... >>> [mu0001.localdomain:22825] select: component cm not selected / finalized >>> [mu0001.localdomain:22825] select: component ob1 not selected / finalized >>> [mu0001.localdomain:22825] select: component yalla selected >>> ...output snipped... >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 22825 on node mu0001 exited >>> on signal 11 (Segmentation fault). >>> >>> -------------------------------------------------------------------------- >>> >>> Interestingly enough, if I tell mpirun what pml to use, the seg fault >>> goes away. The following command does not get the seg fault: >>> >>> $> mpirun -n 1 -mca btl_openib_receive_queues >>> X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x >>> >>> Passing either ob1 or cm to '-mca pml' also works. So it seems that the >>> seg fault comes about when the yalla pml is chosen by default, when >>> mxm/hcoll is involved, and using XRC. I'm not sure if mxm is to blame, >>> however, as using '-mca pml cm -mca mtl mxm' with the XRC parameters >>> doesn't throw the seg fault. >>> >>> Other information... >>> OS: RHEL 6.7-based (TOSS) >>> OpenFabrics: RedHat provided >>> Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64 >>> Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2 which >>> is attached. >>> >>> Is there something else I should be doing with the yalla pml when using >>> XRC? Regardless, I hope reporting the seg fault is useful. >>> >>> Thanks, >>> David >>> >>> -- >>> David Shrader >>> HPC-ENV High Performance Computer Systems >>> Los Alamos National Lab >>> Email: dshrader <at> lanl.gov >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/04/18786.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/04/18788.php >> > > > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18789.php > > > -- > David Shrader > HPC-ENV High Performance Computer Systems > Los Alamos National Lab > Email: dshrader <at> lanl.gov > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/04/18793.php >