Paul,

if ompi is built statically or with --disable-dlopen, I do not think --mca
coll ^ml can prevent the crash (assuming this is the same issue we
discussed before).
note if you build dynamically and without --disable-dlopen, it might or
might not crash, depending on how modules are enumerated, and this is
specific to each system.

so at this stage, I cannot suspect this is a different issue or not.
if the crash still occurs with .ompi_ignore in coll ml, then I could
conclude this is a different issue.

Cheers,

Gilles

On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote:

> Gilles,
>
> This is on Mellanox's own system where /opt/mellanox/hcoll was updates Aug
> 2.
> This problem also did not occur unless I build libmpi statically.
> A run of "mpirun -mca coll ^ml -np 2 examples/ring_c" still crashes.
> So, I really don't know if this is the same issue, but suspect that it is
> not.
>
> -Paul
>
> On Sat, Aug 22, 2015 at 6:00 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote:
>
>> Paul,
>>
>> isn t this an issue that was already discussed ?
>> mellanox proprietary hcoll library includes its own coll ml module that
>> conflicts with the ompi one.
>> mellanox folks fixed this internally but I am not sure this has been
>> released.
>> you can run
>> nm libhcoll.so
>> if there are some symbols starting with coll_ml, then the issue is still
>> there.
>> if you have time and recent autotools, you can
>> touch ompi/mca/coll/ml/.ompi_ignore
>> ./autogen.pl
>> make ...
>> and that should be fine
>>
>> if you configure'd with dynamic libraries and no --disable_dlopen, then
>> mpirun --mca coll ^ml ...
>> is enough to work around the issue.
>>
>> Cheers,
>>
>> Gilles
>>
>> On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov
>> <javascript:_e(%7B%7D,'cvml','phhargr...@lbl.gov');>> wrote:
>>
>>> Having seen problems with mtl:ofi with "--enable-static
>>> --disable-shared", I tried mtl:psm and mtl:mxm with those options as well.
>>>
>>> The good news is that mtl:psm was fine, but the bad news is when testing
>>> mtl:mxm I encountered a new problem involving coll:hcol.
>>> Ralph probably wants to strangle me right now...
>>>
>>>
>>> I am configuring the 1.10.0rc4 tarball with
>>>    --prefix=[...] --enable-debug --with-verbs
>>> --enable-openib-connectx-xrc \
>>>    --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll \
>>>    --enable-static --disable-shared
>>>
>>> Everything was fine without those last two arguments.
>>> When I add them the build is fine, and I can compile the examples.
>>> However, I get a SEGV when running an example:
>>>
>>> $mpirun -np 2 examples/ring_c
>>> [mir13:12444:0] Caught signal 11 (Segmentation fault)
>>> [mir13:12445:0] Caught signal 11 (Segmentation fault)
>>> ==== backtrace ====
>>> ==== backtrace ====
>>>  2 0x0000000000059d9c mxm_handle_error()
>>>  /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/
>>> src/mxm/util/debug/debug.c:641
>>>  3 0x0000000000059f0c mxm_error_signal_handler()
>>>  /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm
>>> -master/src/mxm/util/debug/debug.c:616
>>>  4 0x0000003c2e0329a0 killpg()  ??:0
>>>  5 0x0000000000528b51 opal_list_remove_last()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux
>>> -x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
>>>  6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/ope
>>>
>>> nmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
>>>  7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
>>>  8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
>>>  coll_ml_module.c:0
>>>  9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
>>> 10 0x000000000006c929 hcoll_create_context()  ??:0
>>> 11 0x00000000004a248f mca_coll_hcoll_comm_query()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
>>>
>>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
>>> 12 0x000000000047c82f query_2_0_0()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mx
>>> m-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
>>> 13 0x000000000047c7ee query()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stat
>>> ic/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
>>> 14 0x000000000047c704 check_one_component()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x
>>>
>>> 86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
>>> 15 0x000000000047c567 check_components()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_
>>>
>>> 64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
>>> 16 0x000000000047552a mca_coll_base_comm_select()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
>>>
>>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
>>> 17 0x0000000000428476 ompi_mpi_init()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-
>>> mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
>>> 18 0x0000000000431ba5 PMPI_Init()
>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-
>>> static/BLD/ompi/mpi/c/profile/pinit.c:84
>>> 19 0x000000000040abce main()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stati
>>> c/BLD/examples/ring_c.c:19
>>> 20 0x0000003c2e01ed1d __libc_start_main()  ??:0
>>> 21 0x000000000040aae9 _start()  ??:0
>>> ===================
>>>  2 0x0000000000059d9c mxm_handle_error()
>>>  
>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:641
>>>  3 0x0000000000059f0c mxm_error_signal_handler()
>>>  
>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:616
>>>  4 0x0000003c2e0329a0 killpg()  ??:0
>>>  5 0x0000000000528b51 opal_list_remove_last()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
>>>  6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
>>>  7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
>>>  8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
>>>  coll_ml_module.c:0
>>>  9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
>>> 10 0x000000000006c929 hcoll_create_context()  ??:0
>>> 11 0x00000000004a248f mca_coll_hcoll_comm_query()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
>>> 12 0x000000000047c82f query_2_0_0()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
>>> 13 0x000000000047c7ee query()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
>>> 14 0x000000000047c704 check_one_component()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
>>> 15 0x000000000047c567 check_components()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
>>> 16 0x000000000047552a mca_coll_base_comm_select()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
>>> 17 0x0000000000428476 ompi_mpi_init()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
>>> 18 0x0000000000431ba5 PMPI_Init()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/ompi/mpi/c/profile/pinit.c:84
>>> 19 0x000000000040abce main()
>>>  
>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/examples/ring_c.c:19
>>> 20 0x0000003c2e01ed1d __libc_start_main()  ??:0
>>> 21 0x000000000040aae9 _start()  ??:0
>>> ===================
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 12445 on node mir13 exited
>>> on signal 13 (Broken pipe).
>>>
>>> --------------------------------------------------------------------------
>>>
>>> This is reproducible.
>>> A run with "-np 1" is fine.
>>>
>>> -Paul
>>>
>>> --
>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>> Computer Languages & Systems Software (CLaSS) Group
>>> Computer Science Department               Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/08/17795.php
>>
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> <javascript:_e(%7B%7D,'cvml','phhargr...@lbl.gov');>
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>

Reply via email to