Paul, if ompi is built statically or with --disable-dlopen, I do not think --mca coll ^ml can prevent the crash (assuming this is the same issue we discussed before). note if you build dynamically and without --disable-dlopen, it might or might not crash, depending on how modules are enumerated, and this is specific to each system.
so at this stage, I cannot suspect this is a different issue or not. if the crash still occurs with .ompi_ignore in coll ml, then I could conclude this is a different issue. Cheers, Gilles On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote: > Gilles, > > This is on Mellanox's own system where /opt/mellanox/hcoll was updates Aug > 2. > This problem also did not occur unless I build libmpi statically. > A run of "mpirun -mca coll ^ml -np 2 examples/ring_c" still crashes. > So, I really don't know if this is the same issue, but suspect that it is > not. > > -Paul > > On Sat, Aug 22, 2015 at 6:00 PM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com > <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: > >> Paul, >> >> isn t this an issue that was already discussed ? >> mellanox proprietary hcoll library includes its own coll ml module that >> conflicts with the ompi one. >> mellanox folks fixed this internally but I am not sure this has been >> released. >> you can run >> nm libhcoll.so >> if there are some symbols starting with coll_ml, then the issue is still >> there. >> if you have time and recent autotools, you can >> touch ompi/mca/coll/ml/.ompi_ignore >> ./autogen.pl >> make ... >> and that should be fine >> >> if you configure'd with dynamic libraries and no --disable_dlopen, then >> mpirun --mca coll ^ml ... >> is enough to work around the issue. >> >> Cheers, >> >> Gilles >> >> On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov >> <javascript:_e(%7B%7D,'cvml','phhargr...@lbl.gov');>> wrote: >> >>> Having seen problems with mtl:ofi with "--enable-static >>> --disable-shared", I tried mtl:psm and mtl:mxm with those options as well. >>> >>> The good news is that mtl:psm was fine, but the bad news is when testing >>> mtl:mxm I encountered a new problem involving coll:hcol. >>> Ralph probably wants to strangle me right now... >>> >>> >>> I am configuring the 1.10.0rc4 tarball with >>> --prefix=[...] --enable-debug --with-verbs >>> --enable-openib-connectx-xrc \ >>> --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll \ >>> --enable-static --disable-shared >>> >>> Everything was fine without those last two arguments. >>> When I add them the build is fine, and I can compile the examples. >>> However, I get a SEGV when running an example: >>> >>> $mpirun -np 2 examples/ring_c >>> [mir13:12444:0] Caught signal 11 (Segmentation fault) >>> [mir13:12445:0] Caught signal 11 (Segmentation fault) >>> ==== backtrace ==== >>> ==== backtrace ==== >>> 2 0x0000000000059d9c mxm_handle_error() >>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/ >>> src/mxm/util/debug/debug.c:641 >>> 3 0x0000000000059f0c mxm_error_signal_handler() >>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm >>> -master/src/mxm/util/debug/debug.c:616 >>> 4 0x0000003c2e0329a0 killpg() ??:0 >>> 5 0x0000000000528b51 opal_list_remove_last() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux >>> -x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721 >>> 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/ope >>> >>> nmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537 >>> 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query() ??:0 >>> 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery() >>> coll_ml_module.c:0 >>> 9 0x00000000000317a2 hmca_coll_ml_comm_query() ??:0 >>> 10 0x000000000006c929 hcoll_create_context() ??:0 >>> 11 0x00000000004a248f mca_coll_hcoll_comm_query() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l >>> >>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290 >>> 12 0x000000000047c82f query_2_0_0() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mx >>> m-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392 >>> 13 0x000000000047c7ee query() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stat >>> ic/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375 >>> 14 0x000000000047c704 check_one_component() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x >>> >>> 86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337 >>> 15 0x000000000047c567 check_components() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_ >>> >>> 64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301 >>> 16 0x000000000047552a mca_coll_base_comm_select() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l >>> >>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131 >>> 17 0x0000000000428476 ompi_mpi_init() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64- >>> mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894 >>> 18 0x0000000000431ba5 PMPI_Init() >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm- >>> static/BLD/ompi/mpi/c/profile/pinit.c:84 >>> 19 0x000000000040abce main() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stati >>> c/BLD/examples/ring_c.c:19 >>> 20 0x0000003c2e01ed1d __libc_start_main() ??:0 >>> 21 0x000000000040aae9 _start() ??:0 >>> =================== >>> 2 0x0000000000059d9c mxm_handle_error() >>> >>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:641 >>> 3 0x0000000000059f0c mxm_error_signal_handler() >>> >>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:616 >>> 4 0x0000003c2e0329a0 killpg() ??:0 >>> 5 0x0000000000528b51 opal_list_remove_last() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721 >>> 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537 >>> 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query() ??:0 >>> 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery() >>> coll_ml_module.c:0 >>> 9 0x00000000000317a2 hmca_coll_ml_comm_query() ??:0 >>> 10 0x000000000006c929 hcoll_create_context() ??:0 >>> 11 0x00000000004a248f mca_coll_hcoll_comm_query() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290 >>> 12 0x000000000047c82f query_2_0_0() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392 >>> 13 0x000000000047c7ee query() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375 >>> 14 0x000000000047c704 check_one_component() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337 >>> 15 0x000000000047c567 check_components() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301 >>> 16 0x000000000047552a mca_coll_base_comm_select() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131 >>> 17 0x0000000000428476 ompi_mpi_init() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894 >>> 18 0x0000000000431ba5 PMPI_Init() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/ompi/mpi/c/profile/pinit.c:84 >>> 19 0x000000000040abce main() >>> >>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/examples/ring_c.c:19 >>> 20 0x0000003c2e01ed1d __libc_start_main() ??:0 >>> 21 0x000000000040aae9 _start() ??:0 >>> =================== >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 1 with PID 12445 on node mir13 exited >>> on signal 13 (Broken pipe). >>> >>> -------------------------------------------------------------------------- >>> >>> This is reproducible. >>> A run with "-np 1" is fine. >>> >>> -Paul >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/08/17795.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <javascript:_e(%7B%7D,'cvml','phhargr...@lbl.gov');> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >