I ran mpirun through valgrind and I got some strange complaints about an issue with thread 2. I hunted around mpirun code and I see that we start a thread, but we never have it finish during shutdown. Therefore, I added this snippet of code (probably in the wrong place) and I no longer see my intermittent crashes.
Ralph, what do you think? Does this seem reasonable? Rolf [rvandevaart@drossetti-ivy0 ompi-v1.7]$ svn diff Index: orte/mca/oob/tcp/oob_tcp_component.c =================================================================== --- orte/mca/oob/tcp/oob_tcp_component.c (revision 30500) +++ orte/mca/oob/tcp/oob_tcp_component.c (working copy) @@ -631,6 +631,10 @@ opal_output_verbose(2, orte_oob_base_framework.framework_output, "%s TCP SHUTDOWN", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); + if (ORTE_PROC_IS_HNP) { + mca_oob_tcp_component.listen_thread_active = 0; + opal_thread_join(&mca_oob_tcp_component.listen_thread, NULL); + } while (NULL != (item = opal_list_remove_first(&mca_oob_tcp_component.listeners))) { OBJ_RELEASE(item); >-----Original Message----- >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph >Castain >Sent: Thursday, January 30, 2014 12:35 PM >To: Open MPI Developers >Subject: Re: [OMPI devel] Intermittent mpirun crash? > >That option might explain why your test process is failing (which segfaulted as >well), but obviously wouldn't have anything to do with mpirun > >On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart <rvandeva...@nvidia.com> >wrote: > >> I just retested with --mca mpi_leave_pinned 0 and that made no difference. >I still see the mpirun crash. >> >>> -----Original Message----- >>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George >>> Bosilca >>> Sent: Thursday, January 30, 2014 11:59 AM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] Intermittent mpirun crash? >>> >>> I got something similar 2 days ago, with a large software package >>> abusing of MPI_Waitany/MPI_Waitsome (that was working seamlessly a >>> month ago). I had to find a quick fix. Upon figuring out that turning >>> the leave_pinned off fixes the problem, I did not investigate any further. >>> >>> Do you see a similar behavior? >>> >>> George. >>> >>> On Jan 30, 2014, at 17:26 , Rolf vandeVaart <rvandeva...@nvidia.com> >wrote: >>> >>>> I am seeing this happening to me very intermittently. Looks like >>>> mpirun is >>> getting a SEGV. Is anyone else seeing this? >>>> This is 1.7.4 built yesterday. (Note that I added some stuff to >>>> what is being printed out so the message is slightly different than >>>> 1.7.4 >>>> output) >>>> >>>> mpirun - -np 6 -host >>>> drossetti-ivy0,drossetti-ivy1,drossetti-ivy2,drossetti-ivy3 --mca >>>> btl_openib_warn_default_gid_prefix 0 -- `pwd`/src/MPI_Waitsome_p_c >>>> MPITEST info (0): Starting: MPI_Waitsome_p: Persistent Waitsome >>>> using two nodes >>>> MPITEST_results: MPI_Waitsome_p: Persistent Waitsome using two >>>> nodes all tests PASSED (742) [drossetti-ivy0:10353] *** Process >>>> (mpirun)received signal *** [drossetti-ivy0:10353] Signal: >>>> Segmentation fault (11) [drossetti-ivy0:10353] Signal code: Address >>>> not mapped (1) [drossetti-ivy0:10353] Failing at address: >>>> 0x7fd31e5f208d [drossetti-ivy0:10353] End of signal information - >>>> not sleeping >>>> gmake[1]: *** [MPI_Waitsome_p_c] Segmentation fault (core dumped) >>>> gmake[1]: Leaving directory `/geppetto/home/rvandevaart/public/ompi- >>> tests/trunk/intel_tests' >>>> >>>> (gdb) where >>>> #0 0x00007fd31f620807 in ?? () from /lib64/libgcc_s.so.1 >>>> #1 0x00007fd31f6210b9 in _Unwind_Backtrace () from >>>> /lib64/libgcc_s.so.1 >>>> #2 0x00007fd31fb2893e in backtrace () from /lib64/libc.so.6 >>>> #3 0x00007fd320b0d622 in opal_backtrace_buffer >>> (message_out=0x7fd31e5e33a0, len_out=0x7fd31e5e33ac) >>>> at >>>> ../../../../../opal/mca/backtrace/execinfo/backtrace_execinfo.c:57 >>>> #4 0x00007fd320b0a794 in show_stackframe (signo=11, >>>> info=0x7fd31e5e3930, p=0x7fd31e5e3800) at >>>> ../../../opal/util/stacktrace.c:354 >>>> #5 <signal handler called> >>>> #6 0x00007fd31e5f208d in ?? () >>>> #7 0x00007fd31e5e46d8 in ?? () >>>> #8 0x000000000000c2a8 in ?? () >>>> #9 0x0000000000000000 in ?? () >>>> >>>> >>>> -------------------------------------------------------------------- >>>> -- >>>> ------------- This email message is for the sole use of the intended >>>> recipient(s) and may contain confidential information. Any >>>> unauthorized review, use, disclosure or distribution is prohibited. >>>> If you are not the intended recipient, please contact the sender by >>>> reply email and destroy all copies of the original message. >>>> -------------------------------------------------------------------- >>>> -- >>>> ------------- _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >_______________________________________________ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel