Yes - this is the fix for that issue

On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:

> Is this by any chance associated with issue 579?
>
>
> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>:
>
>> I'll look at the lines you cite, but that clearly isn't the problem we
>> are seeing here. I can verify that because the test case:
>>
>> mpirun -n 1 sleep 1000
>>
>> does not open up any connections at all. Thus, the use-case you describe
>> never occurs - yet we still blow up in memory. If I simply tell the OOB not
>> to set keep alive, the problem goes away.
>>
>> It only happens on Mac, and we never see Mac based clusters, so turning
>> off keep alive on the Mac seems a pretty simple solution.
>>
>>
>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu>
>> wrote:
>>
>>> Ralph,
>>>
>>> The code pushed in g8e30579 is clearly not the right solution.
>>>
>>> The problem starts in oob_tcp_listener.c line 742. A new
>>> mca_oob_tcp_pending_connection_t object is allocated to store the incoming
>>> connection. The accept few lines below fails with an error code of 0x23
>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus,
>>> the if at line 750 is skipped, and we reach line 763 (a "continue") with 1)
>>> a connection not accepted, and 2) an allocated object not release. Voila!
>>>
>>> Freeing the pending_connection object is not the right approach either,
>>> as it will only remove the memory leak but the process will become a CPU
>>> hog.
>>>
>>>   Thanks,
>>>     George.
>>>
>>>
>>>
>>>
>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote:
>>>
>>>> This is an automated email from the git hooks/post-receive script. It
>>>> was
>>>> generated because a ref change was pushed to the repository containing
>>>> the project "open-mpi/ompi".
>>>>
>>>> The branch, master has been updated
>>>>        via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
>>>>       from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)
>>>>
>>>> Those revisions listed above that are new to this repository have
>>>> not appeared on any other notification email; so we list those
>>>> revisions in full, below.
>>>>
>>>> - Log -----------------------------------------------------------------
>>>>
>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>
>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>> Author: Ralph Castain <r...@open-mpi.org>
>>>> Date:   Thu May 14 18:09:13 2015 -0600
>>>>
>>>>     The Mac appears to have problems with the keepalive support - once
>>>> keepalive starts, the memory footprint soars. So disable keepalive on the
>>>> Mac
>>>>
>>>> diff --git a/config/opal_check_os_flavors.m4
>>>> b/config/opal_check_os_flavors.m4
>>>> index d1d124d..4939560 100644
>>>> --- a/config/opal_check_os_flavors.m4
>>>> +++ b/config/opal_check_os_flavors.m4
>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
>>>>                         [$opal_have_solaris],
>>>>                         [Whether or not we have solaris])
>>>>
>>>> +    AS_IF([test "$opal_found_apple" = "yes"],
>>>> +          [opal_have_mac=1], [opal_have_mac=0])
>>>> +    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
>>>> +                       [$opal_have_mac],
>>>> +                       [Whether or not we are on a Mac])
>>>> +
>>>>      # check for sockaddr_in (a good sign we have TCP)
>>>>      AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
>>>>      AC_CHECK_TYPES([struct sockaddr_in],
>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c
>>>> b/orte/mca/oob/tcp/oob_tcp_common.c
>>>> index a768472..e3decf2 100644
>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c
>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c
>>>> @@ -72,7 +72,7 @@
>>>>  /**
>>>>   * Set socket buffering
>>>>   */
>>>> -
>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>  static void set_keepalive(int sd)
>>>>  {
>>>>      int option;
>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd)
>>>>      }
>>>>  #endif  // TCP_KEEPCNT
>>>>  }
>>>> +#endif //SO_KEEPALIVE
>>>>
>>>>  void orte_oob_tcp_set_socket_options(int sd)
>>>>  {
>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
>>>>                              opal_socket_errno);
>>>>      }
>>>>  #endif
>>>> -#if defined(SO_KEEPALIVE)
>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>      if (0 < mca_oob_tcp_component.keepalive_time) {
>>>>          set_keepalive(sd);
>>>>      }
>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c
>>>> b/orte/mca/oob/tcp/oob_tcp_component.c
>>>> index dd1af2a..372ed4c 100644
>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c
>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c
>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void)
>>>>
>>>>  &mca_oob_tcp_component.disable_ipv6_family);
>>>>  #endif
>>>>
>>>> -
>>>> +#if !OPAL_HAVE_MAC
>>>>      mca_oob_tcp_component.keepalive_time = 10;
>>>>      (void)mca_base_component_var_register(component, "keepalive_time",
>>>>                                            "Idle time in seconds before
>>>> starting to send keepalives (num <= 0 ----> disable keepalive)",
>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void)
>>>>                                            OPAL_INFO_LVL_9,
>>>>                                            MCA_BASE_VAR_SCOPE_READONLY,
>>>>
>>>>  &mca_oob_tcp_component.keepalive_probes);
>>>> -
>>>> +#endif
>>>> +
>>>>      mca_oob_tcp_component.retry_delay = 0;
>>>>      (void)mca_base_component_var_register(component, "retry_delay",
>>>>                                            "Time (in sec) to wait
>>>> before trying to connect to peer again",
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>>
>>>> Summary of changes:
>>>>  config/opal_check_os_flavors.m4      | 6 ++++++
>>>>  orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
>>>>  orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
>>>>  3 files changed, 12 insertions(+), 4 deletions(-)
>>>>
>>>>
>>>> hooks/post-receive
>>>> --
>>>> open-mpi/ompi
>>>> _______________________________________________
>>>> ompi-commits mailing list
>>>> ompi-comm...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php
>

Reply via email to