Sure! cmr it across to v1.7.4 and we'll add it before we release Thanks! Ralph
On Jan 30, 2014, at 11:53 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > I ran mpirun through valgrind and I got some strange complaints about an > issue with thread 2. I hunted around mpirun code and I see that we start a > thread, but we never have it finish during shutdown. Therefore, I added this > snippet of code (probably in the wrong place) and I no longer see my > intermittent crashes. > > Ralph, what do you think? Does this seem reasonable? > > Rolf > > [rvandevaart@drossetti-ivy0 ompi-v1.7]$ svn diff > Index: orte/mca/oob/tcp/oob_tcp_component.c > =================================================================== > --- orte/mca/oob/tcp/oob_tcp_component.c (revision 30500) > +++ orte/mca/oob/tcp/oob_tcp_component.c (working copy) > @@ -631,6 +631,10 @@ > opal_output_verbose(2, orte_oob_base_framework.framework_output, > "%s TCP SHUTDOWN", > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)); > + if (ORTE_PROC_IS_HNP) { > + mca_oob_tcp_component.listen_thread_active = 0; > + opal_thread_join(&mca_oob_tcp_component.listen_thread, NULL); > + } > > while (NULL != (item = > opal_list_remove_first(&mca_oob_tcp_component.listeners))) { > OBJ_RELEASE(item); > > >> -----Original Message----- >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph >> Castain >> Sent: Thursday, January 30, 2014 12:35 PM >> To: Open MPI Developers >> Subject: Re: [OMPI devel] Intermittent mpirun crash? >> >> That option might explain why your test process is failing (which segfaulted >> as >> well), but obviously wouldn't have anything to do with mpirun >> >> On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart <rvandeva...@nvidia.com> >> wrote: >> >>> I just retested with --mca mpi_leave_pinned 0 and that made no difference. >> I still see the mpirun crash. >>> >>>> -----Original Message----- >>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George >>>> Bosilca >>>> Sent: Thursday, January 30, 2014 11:59 AM >>>> To: Open MPI Developers >>>> Subject: Re: [OMPI devel] Intermittent mpirun crash? >>>> >>>> I got something similar 2 days ago, with a large software package >>>> abusing of MPI_Waitany/MPI_Waitsome (that was working seamlessly a >>>> month ago). I had to find a quick fix. Upon figuring out that turning >>>> the leave_pinned off fixes the problem, I did not investigate any further. >>>> >>>> Do you see a similar behavior? >>>> >>>> George. >>>> >>>> On Jan 30, 2014, at 17:26 , Rolf vandeVaart <rvandeva...@nvidia.com> >> wrote: >>>> >>>>> I am seeing this happening to me very intermittently. Looks like >>>>> mpirun is >>>> getting a SEGV. Is anyone else seeing this? >>>>> This is 1.7.4 built yesterday. (Note that I added some stuff to >>>>> what is being printed out so the message is slightly different than >>>>> 1.7.4 >>>>> output) >>>>> >>>>> mpirun - -np 6 -host >>>>> drossetti-ivy0,drossetti-ivy1,drossetti-ivy2,drossetti-ivy3 --mca >>>>> btl_openib_warn_default_gid_prefix 0 -- `pwd`/src/MPI_Waitsome_p_c >>>>> MPITEST info (0): Starting: MPI_Waitsome_p: Persistent Waitsome >>>>> using two nodes >>>>> MPITEST_results: MPI_Waitsome_p: Persistent Waitsome using two >>>>> nodes all tests PASSED (742) [drossetti-ivy0:10353] *** Process >>>>> (mpirun)received signal *** [drossetti-ivy0:10353] Signal: >>>>> Segmentation fault (11) [drossetti-ivy0:10353] Signal code: Address >>>>> not mapped (1) [drossetti-ivy0:10353] Failing at address: >>>>> 0x7fd31e5f208d [drossetti-ivy0:10353] End of signal information - >>>>> not sleeping >>>>> gmake[1]: *** [MPI_Waitsome_p_c] Segmentation fault (core dumped) >>>>> gmake[1]: Leaving directory `/geppetto/home/rvandevaart/public/ompi- >>>> tests/trunk/intel_tests' >>>>> >>>>> (gdb) where >>>>> #0 0x00007fd31f620807 in ?? () from /lib64/libgcc_s.so.1 >>>>> #1 0x00007fd31f6210b9 in _Unwind_Backtrace () from >>>>> /lib64/libgcc_s.so.1 >>>>> #2 0x00007fd31fb2893e in backtrace () from /lib64/libc.so.6 >>>>> #3 0x00007fd320b0d622 in opal_backtrace_buffer >>>> (message_out=0x7fd31e5e33a0, len_out=0x7fd31e5e33ac) >>>>> at >>>>> ../../../../../opal/mca/backtrace/execinfo/backtrace_execinfo.c:57 >>>>> #4 0x00007fd320b0a794 in show_stackframe (signo=11, >>>>> info=0x7fd31e5e3930, p=0x7fd31e5e3800) at >>>>> ../../../opal/util/stacktrace.c:354 >>>>> #5 <signal handler called> >>>>> #6 0x00007fd31e5f208d in ?? () >>>>> #7 0x00007fd31e5e46d8 in ?? () >>>>> #8 0x000000000000c2a8 in ?? () >>>>> #9 0x0000000000000000 in ?? () >>>>> >>>>> >>>>> -------------------------------------------------------------------- >>>>> -- >>>>> ------------- This email message is for the sole use of the intended >>>>> recipient(s) and may contain confidential information. Any >>>>> unauthorized review, use, disclosure or distribution is prohibited. >>>>> If you are not the intended recipient, please contact the sender by >>>>> reply email and destroy all copies of the original message. >>>>> -------------------------------------------------------------------- >>>>> -- >>>>> ------------- _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel