David, thanks for the info you provided.
I will try to dig in further to see what might be causing this issue.
In the meantime, maybe Nathan can please comment about the openib btl
behavior here?
Thanks,
Alina.
On Wed, Apr 20, 2016 at 8:01 PM, David Shrader <dshra...@lanl.gov> wrote:
Hello Alina,
Thank you for the information about how the pml components work. I knew
that the other components were being opened and ultimately closed in
favor of yalla, but I didn't realize that initial open would cause a
persistent change in the ompi runtime.
Here's the information you requested about the ib network:
- MOFED version:
We are using the Open Fabrics Software as bundled by RedHat, and my ib
network folks say we're running something close to v1.5.4
- ibv_devinfo:
[dshrader@mu0001 examples]$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.9.1000
node_guid: 0025:90ff:ff16:78d8
sys_image_guid: 0025:90ff:ff16:78db
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: SM_2121000001000
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 250
port_lid: 366
port_lmc: 0x00
link_layer: InfiniBand
I still get the seg fault when specifying the hca:
$> mpirun -n 1 -mca btl_openib_receive_queues
X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include mlx4_0
./hello_c.x
Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI
dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev:
v1.10.1-145-g799148f, Jan 21, 2016, 135)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10045 on node mu0001 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I don't know if this helps, but the first time I tried the command I
mistyped the hca name. This got me a warning, but no seg fault:
$> mpirun -n 1 -mca btl_openib_receive_queues
X,4096,1024:X,12288,512:X,65536,512 -mca btl_openib_if_include ml4_0
./hello_c.x
--------------------------------------------------------------------------
WARNING: One or more nonexistent OpenFabrics devices/ports were
specified:
Host: mu0001
MCA parameter: mca_btl_if_include
Nonexistent entities: ml4_0
These entities will be ignored. You can disable this warning by
setting the btl_openib_warn_nonexistent_if MCA parameter to 0.
--------------------------------------------------------------------------
Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI
dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev:
v1.10.1-145-g799148f, Jan 21, 2016, 135)
So, telling the openib btl to use the actual hca didn't get the seg
fault to go away, but giving it a dummy value did.
Thanks,
David
On 04/20/2016 08:13 AM, Alina Sklarevich wrote:
Hi David,
I was able to reproduce the issue you reported.
When the command line doesn't specify the components to use, ompi will
try to load/open all the ones available (and close them in the end)
and then choose the components according to their priority and whether
or not they were opened successfully.
This means that even if pml yalla was the one running, other
components were opened and closed as well.
The parameter you are using - btl_openib_receive_queues, doesn't have
an effect on pml yalla. It only affects the openib btl which is used
by pml ob1.
Using the verbosity of btl_base_verbose I see that when the
segmentation fault happens, the code doesn't reach the phase of
unloading the openib btl so perhaps the problem originates there
(since pml yalla was already unloaded).
Can you please try adding this mca parameter to your command line to
specify the HCA you are using?
-mca btl_openib_if_include <hca>
It made the segv go away for me.
Can you please attach the output of ibv_devinfo and write the MOFED
version you are using?
Thank you,
Alina.
On Wed, Apr 20, 2016 at 2:53 PM, Joshua Ladd <jladd.m...@gmail.com>
wrote:
Hi, David
We are looking into your report.
Best,
Josh
On Tue, Apr 19, 2016 at 4:41 PM, David Shrader <dshra...@lanl.gov>
wrote:
Hello,
I have been investigating using XRC on a cluster with a mellanox
interconnect. I have found that in a certain situation I get a seg
fault. I am using 1.10.2 compiled with gcc 5.3.0, and the simplest
configure line that I have found that still results in the seg
fault is as follows:
$> ./configure --with-hcoll --with-mxm --prefix=...
I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system
space (/usr/lib64). If I use '--without-hcoll --without-mxm,' the
seg fault does not happen.
The seg fault happens even when using examples/hello_c.c, so here
is an example of the seg fault using it:
$> mpicc hello_c.c -o hello_c.x
$> mpirun -n 1 ./hello_c.x
Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI
dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev:
v1.10.1-145-g799148f, Jan 21, 2016, 135)
$> mpirun -n 1 -mca btl_openib_receive_queues
X,4096,1024:X,12288,512:X,65536,512
Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI
dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev:
v1.10.1-145-g799148f, Jan 21, 2016, 135)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 22819 on node mu0001
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The seg fault happens no matter the number of ranks. I have tried
the above command with '-mca pml_base_verbose,' and it shows that
I am using the yalla pml:
$> mpirun -n 1 -mca btl_openib_receive_queues
X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100
./hello_c.x
...output snipped...
[mu0001.localdomain:22825] select: component cm not selected /
finalized
[mu0001.localdomain:22825] select: component ob1 not selected /
finalized
[mu0001.localdomain:22825] select: component yalla selected
...output snipped...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 22825 on node mu0001
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Interestingly enough, if I tell mpirun what pml to use, the seg
fault goes away. The following command does not get the seg fault:
$> mpirun -n 1 -mca btl_openib_receive_queues
X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x
Passing either ob1 or cm to '-mca pml' also works. So it seems
that the seg fault comes about when the yalla pml is chosen by
default, when mxm/hcoll is involved, and using XRC. I'm not sure
if mxm is to blame, however, as using '-mca pml cm -mca mtl mxm'
with the XRC parameters doesn't throw the seg fault.
Other information...
OS: RHEL 6.7-based (TOSS)
OpenFabrics: RedHat provided
Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64
Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2
which is attached.
Is there something else I should be doing with the yalla pml when
using XRC? Regardless, I hope reporting the seg fault is useful.
Thanks,
David
--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/04/18786.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/04/18788.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/04/18789.php
--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/04/18793.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/04/18801.php