I tracked down a possible source of the oob/tcp error - this should address it, 
I think: https://github.com/open-mpi/ompi/pull/3794 
<https://github.com/open-mpi/ompi/pull/3794>

> On Jun 29, 2017, at 3:14 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
> 
> Hi Brian,
> 
> I tested this rc using both srun native launch and mpirun on the following 
> systems:
> - LANL CTS-1 systems (haswell + Intel OPA/PSM2)
> - LANL network testbed system (haswell  + connectX5/UCX and OB1)
> - LANL Cray XC
> 
> I am finding some problems with mpirun on the network testbed system.  
> 
> For example, for spawn_with_env_vars from IBM tests:
> 
> *** Error in `mpirun': corrupted double-linked list: 0x00000000006e75b0 ***
> 
> ======= Backtrace: =========
> 
> /usr/lib64/libc.so.6(+0x7bea2)[0x7ffff6597ea2]
> 
> /usr/lib64/libc.so.6(+0x7cec6)[0x7ffff6598ec6]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(opal_proc_table_remove_all+0x91)[0x7ffff7855851]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ud.so(+0x5e09)[0x7ffff3cc0e09]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ud.so(+0x5952)[0x7ffff3cc0952]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(+0x6b032)[0x7ffff7b94032]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(mca_base_framework_close+0x7d)[0x7ffff788592d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[0x7ffff5b04e4d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(orte_finalize+0x79)[0x7ffff7b43bf9]
> 
> mpirun[0x4014f1]
> 
> mpirun[0x401018]
> 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff653db15]
> 
> mpirun[0x400f29]
> 
> 
> and another like
> 
> [hpp@hi-master dynamic (master *)]$mpirun -np 1 ./spawn_with_env_vars
> 
> Spawning...
> 
> Spawned
> 
> Child got foo and baz env variables -- yay!
> 
> *** Error in `mpirun': corrupted double-linked list: 0x00000000006eb350 ***
> 
> ======= Backtrace: =========
> 
> /usr/lib64/libc.so.6(+0x7b184)[0x7ffff6597184]
> 
> /usr/lib64/libc.so.6(+0x7d1ec)[0x7ffff65991ec]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x57a2)[0x7ffff32297a2]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x5a87)[0x7ffff3229a87]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(+0x6b032)[0x7ffff7b94032]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40(mca_base_framework_close+0x7d)[0x7ffff788592d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[0x7ffff5b04e4d]
> 
> /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40(orte_finalize+0x79)[0x7ffff7b43bf9]
> 
> mpirun[0x4014f1]
> 
> mpirun[0x401018]
> 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff653db15]
> 
> mpirun[0x400f29]
> 
> It doesn't happen on every run though.
> 
> I'll do some more investigating, but probably not till next week.
> 
> Howard
> 
> 
> 2017-06-28 11:50 GMT-06:00 Barrett, Brian via devel <devel@lists.open-mpi.org 
> <mailto:devel@lists.open-mpi.org>>:
> The first release candidate of Open MPI 3.0.0 is now available 
> (https://www.open-mpi.org/software/ompi/v3.0/ 
> <https://www.open-mpi.org/software/ompi/v3.0/>).  We expect to have at least 
> one more release candidate, as there are still outstanding MPI-layer issues 
> to be resolved (particularly around one-sided).  We are posting 3.0.0rc1 to 
> get feedback on run-time stability, as one of the big features of Open MPI 
> 3.0 is the update to the PMIx 2 runtime environment.  We would appreciate any 
> and all testing you can do,  around run-time behaviors.
> 
> Thank you,
> 
> Brian & Howard
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to