Brian,

Things look much better with this patch.  We need it for 3.0.0 release
The patch from 3794 applied cleanly from master.

Howard


2017-06-29 16:51 GMT-06:00 r...@open-mpi.org <r...@open-mpi.org>:

> I tracked down a possible source of the oob/tcp error - this should
> address it, I think: https://github.com/open-mpi/ompi/pull/3794
>
> On Jun 29, 2017, at 3:14 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
>
> Hi Brian,
>
> I tested this rc using both srun native launch and mpirun on the following
> systems:
> - LANL CTS-1 systems (haswell + Intel OPA/PSM2)
> - LANL network testbed system (haswell  + connectX5/UCX and OB1)
> - LANL Cray XC
>
> I am finding some problems with mpirun on the network testbed system.
>
> For example, for spawn_with_env_vars from IBM tests:
>
> *** Error in `mpirun': corrupted double-linked list: 0x00000000006e75b0 ***
>
> ======= Backtrace: =========
>
> /usr/lib64/libc.so.6(+0x7bea2)[0x7ffff6597ea2]
>
> /usr/lib64/libc.so.6(+0x7cec6)[0x7ffff6598ec6]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(
> opal_proc_table_remove_all+0x91)[0x7ffff7855851]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_
> ud.so(+0x5e09)[0x7ffff3cc0e09]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_
> ud.so(+0x5952)[0x7ffff3cc0952]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> +0x6b032)[0x7ffff7b94032]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(
> mca_base_framework_close+0x7d)[0x7ffff788592d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[
> 0x7ffff5b04e4d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> orte_finalize+0x79)[0x7ffff7b43bf9]
>
> mpirun[0x4014f1]
>
> mpirun[0x401018]
>
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff653db15]
>
> mpirun[0x400f29]
>
> and another like
>
> [hpp@hi-master dynamic (master *)]$mpirun -np 1 ./spawn_with_env_vars
>
> Spawning...
>
> Spawned
>
> Child got foo and baz env variables -- yay!
>
> *** Error in `mpirun': corrupted double-linked list: 0x00000000006eb350 ***
>
> ======= Backtrace: =========
>
> /usr/lib64/libc.so.6(+0x7b184)[0x7ffff6597184]
>
> /usr/lib64/libc.so.6(+0x7d1ec)[0x7ffff65991ec]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x57a2)[
> 0x7ffff32297a2]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x5a87)[
> 0x7ffff3229a87]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> +0x6b032)[0x7ffff7b94032]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(
> mca_base_framework_close+0x7d)[0x7ffff788592d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[
> 0x7ffff5b04e4d]
>
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(
> orte_finalize+0x79)[0x7ffff7b43bf9]
>
> mpirun[0x4014f1]
>
> mpirun[0x401018]
>
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff653db15]
>
> mpirun[0x400f29]
> It doesn't happen on every run though.
>
> I'll do some more investigating, but probably not till next week.
>
> Howard
>
>
> 2017-06-28 11:50 GMT-06:00 Barrett, Brian via devel <
> devel@lists.open-mpi.org>:
>
>> The first release candidate of Open MPI 3.0.0 is now available (
>> https://www.open-mpi.org/software/ompi/v3.0/).  We expect to have at
>> least one more release candidate, as there are still outstanding MPI-layer
>> issues to be resolved (particularly around one-sided).  We are posting
>> 3.0.0rc1 to get feedback on run-time stability, as one of the big features
>> of Open MPI 3.0 is the update to the PMIx 2 runtime environment.  We would
>> appreciate any and all testing you can do,  around run-time behaviors.
>>
>> Thank you,
>>
>> Brian & Howard
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to