Sure! cmr it across to v1.7.4 and we'll add it before we release

Thanks!
Ralph

On Jan 30, 2014, at 11:53 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> I ran mpirun through valgrind and I got some strange complaints about an 
> issue with thread 2.  I hunted around mpirun code and I see that we start a 
> thread, but we never have it finish during shutdown.  Therefore, I added this 
> snippet of code (probably in the wrong place) and I no longer see my 
> intermittent crashes.
> 
> Ralph, what do you think?  Does this seem reasonable?
> 
> Rolf
> 
> [rvandevaart@drossetti-ivy0 ompi-v1.7]$ svn diff
> Index: orte/mca/oob/tcp/oob_tcp_component.c
> ===================================================================
> --- orte/mca/oob/tcp/oob_tcp_component.c      (revision 30500)
> +++ orte/mca/oob/tcp/oob_tcp_component.c      (working copy)
> @@ -631,6 +631,10 @@
>     opal_output_verbose(2, orte_oob_base_framework.framework_output,
>                         "%s TCP SHUTDOWN",
>                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
> +    if (ORTE_PROC_IS_HNP) {
> +        mca_oob_tcp_component.listen_thread_active = 0;
> +        opal_thread_join(&mca_oob_tcp_component.listen_thread, NULL);
> +    }
> 
>     while (NULL != (item = 
> opal_list_remove_first(&mca_oob_tcp_component.listeners))) {
>         OBJ_RELEASE(item);
> 
> 
>> -----Original Message-----
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph
>> Castain
>> Sent: Thursday, January 30, 2014 12:35 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] Intermittent mpirun crash?
>> 
>> That option might explain why your test process is failing (which segfaulted 
>> as
>> well), but obviously wouldn't have anything to do with mpirun
>> 
>> On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart <rvandeva...@nvidia.com>
>> wrote:
>> 
>>> I just retested with --mca mpi_leave_pinned 0 and that made no difference.
>> I still see the mpirun crash.
>>> 
>>>> -----Original Message-----
>>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George
>>>> Bosilca
>>>> Sent: Thursday, January 30, 2014 11:59 AM
>>>> To: Open MPI Developers
>>>> Subject: Re: [OMPI devel] Intermittent mpirun crash?
>>>> 
>>>> I got something similar 2 days ago, with a large software package
>>>> abusing of MPI_Waitany/MPI_Waitsome (that was working seamlessly a
>>>> month ago). I had to find a quick fix. Upon figuring out that turning
>>>> the leave_pinned off fixes the problem, I did not investigate any further.
>>>> 
>>>> Do you see a similar behavior?
>>>> 
>>>> George.
>>>> 
>>>> On Jan 30, 2014, at 17:26 , Rolf vandeVaart <rvandeva...@nvidia.com>
>> wrote:
>>>> 
>>>>> I am seeing this happening to me very intermittently.  Looks like
>>>>> mpirun is
>>>> getting a SEGV.  Is anyone else seeing this?
>>>>> This is 1.7.4 built yesterday.  (Note that I added some stuff to
>>>>> what is being printed out so the message is slightly different than
>>>>> 1.7.4
>>>>> output)
>>>>> 
>>>>> mpirun - -np 6 -host
>>>>> drossetti-ivy0,drossetti-ivy1,drossetti-ivy2,drossetti-ivy3 --mca
>>>>> btl_openib_warn_default_gid_prefix 0  --  `pwd`/src/MPI_Waitsome_p_c
>>>>> MPITEST info  (0): Starting:  MPI_Waitsome_p:  Persistent Waitsome
>>>>> using two nodes
>>>>> MPITEST_results: MPI_Waitsome_p:  Persistent Waitsome using two
>>>>> nodes all tests PASSED (742) [drossetti-ivy0:10353] *** Process
>>>>> (mpirun)received signal *** [drossetti-ivy0:10353] Signal:
>>>>> Segmentation fault (11) [drossetti-ivy0:10353] Signal code: Address
>>>>> not mapped (1) [drossetti-ivy0:10353] Failing at address:
>>>>> 0x7fd31e5f208d [drossetti-ivy0:10353] End of signal information -
>>>>> not sleeping
>>>>> gmake[1]: *** [MPI_Waitsome_p_c] Segmentation fault (core dumped)
>>>>> gmake[1]: Leaving directory `/geppetto/home/rvandevaart/public/ompi-
>>>> tests/trunk/intel_tests'
>>>>> 
>>>>> (gdb) where
>>>>> #0  0x00007fd31f620807 in ?? () from /lib64/libgcc_s.so.1
>>>>> #1  0x00007fd31f6210b9 in _Unwind_Backtrace () from
>>>>> /lib64/libgcc_s.so.1
>>>>> #2  0x00007fd31fb2893e in backtrace () from /lib64/libc.so.6
>>>>> #3  0x00007fd320b0d622 in opal_backtrace_buffer
>>>> (message_out=0x7fd31e5e33a0, len_out=0x7fd31e5e33ac)
>>>>>  at
>>>>> ../../../../../opal/mca/backtrace/execinfo/backtrace_execinfo.c:57
>>>>> #4  0x00007fd320b0a794 in show_stackframe (signo=11,
>>>>> info=0x7fd31e5e3930, p=0x7fd31e5e3800) at
>>>>> ../../../opal/util/stacktrace.c:354
>>>>> #5  <signal handler called>
>>>>> #6  0x00007fd31e5f208d in ?? ()
>>>>> #7  0x00007fd31e5e46d8 in ?? ()
>>>>> #8  0x000000000000c2a8 in ?? ()
>>>>> #9  0x0000000000000000 in ?? ()
>>>>> 
>>>>> 
>>>>> --------------------------------------------------------------------
>>>>> --
>>>>> ------------- This email message is for the sole use of the intended
>>>>> recipient(s) and may contain confidential information.  Any
>>>>> unauthorized review, use, disclosure or distribution is prohibited.
>>>>> If you are not the intended recipient, please contact the sender by
>>>>> reply email and destroy all copies of the original message.
>>>>> --------------------------------------------------------------------
>>>>> --
>>>>> ------------- _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to