Brian, Things look much better with this patch. We need it for 3.0.0 release The patch from 3794 applied cleanly from master.
Howard 2017-06-29 16:51 GMT-06:00 r...@open-mpi.org <r...@open-mpi.org>: > I tracked down a possible source of the oob/tcp error - this should > address it, I think: https://github.com/open-mpi/ompi/pull/3794 > > On Jun 29, 2017, at 3:14 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > Hi Brian, > > I tested this rc using both srun native launch and mpirun on the following > systems: > - LANL CTS-1 systems (haswell + Intel OPA/PSM2) > - LANL network testbed system (haswell + connectX5/UCX and OB1) > - LANL Cray XC > > I am finding some problems with mpirun on the network testbed system. > > For example, for spawn_with_env_vars from IBM tests: > > *** Error in `mpirun': corrupted double-linked list: 0x00000000006e75b0 *** > > ======= Backtrace: ========= > > /usr/lib64/libc.so.6(+0x7bea2)[0x7ffff6597ea2] > > /usr/lib64/libc.so.6(+0x7cec6)[0x7ffff6598ec6] > > /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40( > opal_proc_table_remove_all+0x91)[0x7ffff7855851] > > /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ > ud.so(+0x5e09)[0x7ffff3cc0e09] > > /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_ > ud.so(+0x5952)[0x7ffff3cc0952] > > /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40( > +0x6b032)[0x7ffff7b94032] > > /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40( > mca_base_framework_close+0x7d)[0x7ffff788592d] > > /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[ > 0x7ffff5b04e4d] > > /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40( > orte_finalize+0x79)[0x7ffff7b43bf9] > > mpirun[0x4014f1] > > mpirun[0x401018] > > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff653db15] > > mpirun[0x400f29] > > and another like > > [hpp@hi-master dynamic (master *)]$mpirun -np 1 ./spawn_with_env_vars > > Spawning... > > Spawned > > Child got foo and baz env variables -- yay! > > *** Error in `mpirun': corrupted double-linked list: 0x00000000006eb350 *** > > ======= Backtrace: ========= > > /usr/lib64/libc.so.6(+0x7b184)[0x7ffff6597184] > > /usr/lib64/libc.so.6(+0x7d1ec)[0x7ffff65991ec] > > /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x57a2)[ > 0x7ffff32297a2] > > /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_oob_tcp.so(+0x5a87)[ > 0x7ffff3229a87] > > /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40( > +0x6b032)[0x7ffff7b94032] > > /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-pal.so.40( > mca_base_framework_close+0x7d)[0x7ffff788592d] > > /home/hpp/openmpi_3.0.0rc1_install/lib/openmpi/mca_ess_hnp.so(+0x3e4d)[ > 0x7ffff5b04e4d] > > /home/hpp/openmpi_3.0.0rc1_install/lib/libopen-rte.so.40( > orte_finalize+0x79)[0x7ffff7b43bf9] > > mpirun[0x4014f1] > > mpirun[0x401018] > > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff653db15] > > mpirun[0x400f29] > It doesn't happen on every run though. > > I'll do some more investigating, but probably not till next week. > > Howard > > > 2017-06-28 11:50 GMT-06:00 Barrett, Brian via devel < > devel@lists.open-mpi.org>: > >> The first release candidate of Open MPI 3.0.0 is now available ( >> https://www.open-mpi.org/software/ompi/v3.0/). We expect to have at >> least one more release candidate, as there are still outstanding MPI-layer >> issues to be resolved (particularly around one-sided). We are posting >> 3.0.0rc1 to get feedback on run-time stability, as one of the big features >> of Open MPI 3.0 is the update to the PMIx 2 runtime environment. We would >> appreciate any and all testing you can do, around run-time behaviors. >> >> Thank you, >> >> Brian & Howard >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel