Paul, isn t this an issue that was already discussed ? mellanox proprietary hcoll library includes its own coll ml module that conflicts with the ompi one. mellanox folks fixed this internally but I am not sure this has been released. you can run nm libhcoll.so if there are some symbols starting with coll_ml, then the issue is still there. if you have time and recent autotools, you can touch ompi/mca/coll/ml/.ompi_ignore ./autogen.pl make ... and that should be fine
if you configure'd with dynamic libraries and no --disable_dlopen, then mpirun --mca coll ^ml ... is enough to work around the issue. Cheers, Gilles On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote: > Having seen problems with mtl:ofi with "--enable-static --disable-shared", > I tried mtl:psm and mtl:mxm with those options as well. > > The good news is that mtl:psm was fine, but the bad news is when testing > mtl:mxm I encountered a new problem involving coll:hcol. > Ralph probably wants to strangle me right now... > > > I am configuring the 1.10.0rc4 tarball with > --prefix=[...] --enable-debug --with-verbs --enable-openib-connectx-xrc > \ > --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll \ > --enable-static --disable-shared > > Everything was fine without those last two arguments. > When I add them the build is fine, and I can compile the examples. > However, I get a SEGV when running an example: > > $mpirun -np 2 examples/ring_c > [mir13:12444:0] Caught signal 11 (Segmentation fault) > [mir13:12445:0] Caught signal 11 (Segmentation fault) > ==== backtrace ==== > ==== backtrace ==== > 2 0x0000000000059d9c mxm_handle_error() > /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/ > src/mxm/util/debug/debug.c:641 > 3 0x0000000000059f0c mxm_error_signal_handler() > /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm > -master/src/mxm/util/debug/debug.c:616 > 4 0x0000003c2e0329a0 killpg() ??:0 > 5 0x0000000000528b51 opal_list_remove_last() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux > -x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721 > 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/ope > > nmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537 > 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query() ??:0 > 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery() > coll_ml_module.c:0 > 9 0x00000000000317a2 hmca_coll_ml_comm_query() ??:0 > 10 0x000000000006c929 hcoll_create_context() ??:0 > 11 0x00000000004a248f mca_coll_hcoll_comm_query() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l > > inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290 > 12 0x000000000047c82f query_2_0_0() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mx > m-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392 > 13 0x000000000047c7ee query() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stat > ic/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375 > 14 0x000000000047c704 check_one_component() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x > > 86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337 > 15 0x000000000047c567 check_components() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_ > > 64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301 > 16 0x000000000047552a mca_coll_base_comm_select() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l > > inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131 > 17 0x0000000000428476 ompi_mpi_init() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64- > mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894 > 18 0x0000000000431ba5 PMPI_Init() > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm- > static/BLD/ompi/mpi/c/profile/pinit.c:84 > 19 0x000000000040abce main() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stati > c/BLD/examples/ring_c.c:19 > 20 0x0000003c2e01ed1d __libc_start_main() ??:0 > 21 0x000000000040aae9 _start() ??:0 > =================== > 2 0x0000000000059d9c mxm_handle_error() > > /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:641 > 3 0x0000000000059f0c mxm_error_signal_handler() > > /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:616 > 4 0x0000003c2e0329a0 killpg() ??:0 > 5 0x0000000000528b51 opal_list_remove_last() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721 > 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537 > 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query() ??:0 > 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery() > coll_ml_module.c:0 > 9 0x00000000000317a2 hmca_coll_ml_comm_query() ??:0 > 10 0x000000000006c929 hcoll_create_context() ??:0 > 11 0x00000000004a248f mca_coll_hcoll_comm_query() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290 > 12 0x000000000047c82f query_2_0_0() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392 > 13 0x000000000047c7ee query() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375 > 14 0x000000000047c704 check_one_component() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337 > 15 0x000000000047c567 check_components() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301 > 16 0x000000000047552a mca_coll_base_comm_select() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131 > 17 0x0000000000428476 ompi_mpi_init() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894 > 18 0x0000000000431ba5 PMPI_Init() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/ompi/mpi/c/profile/pinit.c:84 > 19 0x000000000040abce main() > > /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/examples/ring_c.c:19 > 20 0x0000003c2e01ed1d __libc_start_main() ??:0 > 21 0x000000000040aae9 _start() ??:0 > =================== > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 12445 on node mir13 exited on > signal 13 (Broken pipe). > -------------------------------------------------------------------------- > > This is reproducible. > A run with "-np 1" is fine. > > -Paul > > -- > Paul H. Hargrove phhargr...@lbl.gov > <javascript:_e(%7B%7D,'cvml','phhargr...@lbl.gov');> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >