Having seen problems with mtl:ofi with "--enable-static --disable-shared",
I tried mtl:psm and mtl:mxm with those options as well.

The good news is that mtl:psm was fine, but the bad news is when testing
mtl:mxm I encountered a new problem involving coll:hcol.
Ralph probably wants to strangle me right now...


I am configuring the 1.10.0rc4 tarball with
   --prefix=[...] --enable-debug --with-verbs --enable-openib-connectx-xrc \
   --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll \
   --enable-static --disable-shared

Everything was fine without those last two arguments.
When I add them the build is fine, and I can compile the examples.
However, I get a SEGV when running an example:

$mpirun -np 2 examples/ring_c
[mir13:12444:0] Caught signal 11 (Segmentation fault)
[mir13:12445:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
==== backtrace ====
 2 0x0000000000059d9c mxm_handle_error()
 /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/
src/mxm/util/debug/debug.c:641
 3 0x0000000000059f0c mxm_error_signal_handler()
 /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm
-master/src/mxm/util/debug/debug.c:616
 4 0x0000003c2e0329a0 killpg()  ??:0
 5 0x0000000000528b51 opal_list_remove_last()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux
-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/ope
nmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
 coll_ml_module.c:0
 9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
10 0x000000000006c929 hcoll_create_context()  ??:0
11 0x00000000004a248f mca_coll_hcoll_comm_query()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
12 0x000000000047c82f query_2_0_0()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mx
m-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
13 0x000000000047c7ee query()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stat
ic/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
14 0x000000000047c704 check_one_component()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x
86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
15 0x000000000047c567 check_components()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_
64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
16 0x000000000047552a mca_coll_base_comm_select()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
17 0x0000000000428476 ompi_mpi_init()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-
mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
18 0x0000000000431ba5 PMPI_Init()
 /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-
static/BLD/ompi/mpi/c/profile/pinit.c:84
19 0x000000000040abce main()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stati
c/BLD/examples/ring_c.c:19
20 0x0000003c2e01ed1d __libc_start_main()  ??:0
21 0x000000000040aae9 _start()  ??:0
===================
 2 0x0000000000059d9c mxm_handle_error()
 
/hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:641
 3 0x0000000000059f0c mxm_error_signal_handler()
 
/hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:616
 4 0x0000003c2e0329a0 killpg()  ??:0
 5 0x0000000000528b51 opal_list_remove_last()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
 coll_ml_module.c:0
 9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
10 0x000000000006c929 hcoll_create_context()  ??:0
11 0x00000000004a248f mca_coll_hcoll_comm_query()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
12 0x000000000047c82f query_2_0_0()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
13 0x000000000047c7ee query()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
14 0x000000000047c704 check_one_component()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
15 0x000000000047c567 check_components()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
16 0x000000000047552a mca_coll_base_comm_select()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
17 0x0000000000428476 ompi_mpi_init()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
18 0x0000000000431ba5 PMPI_Init()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/ompi/mpi/c/profile/pinit.c:84
19 0x000000000040abce main()
 
/hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/examples/ring_c.c:19
20 0x0000003c2e01ed1d __libc_start_main()  ??:0
21 0x000000000040aae9 _start()  ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 12445 on node mir13 exited on
signal 13 (Broken pipe).
--------------------------------------------------------------------------

This is reproducible.
A run with "-np 1" is fine.

-Paul

-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to