Interesting - as I said, I'll take a look. In either case, the keep alive
on the Mac is unnecessary as it is always a standalone scenario - no value
in running it. So the "fix" does no harm and just saves some useless
overhead.


On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> I'm sorry Ralph what you proposed is not really a fix. My comment is based
> on a real execution of exactly the command you provided with lldb attached
> to the process. What I see is millions of 
> OBJ_NEW(mca_oob_tcp_pending_connection_t)
> because the EAGAIN is not correctly handled.
>
>   George.
>
>
> On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Yes - this is the fix for that issue
>>
>>
>> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com>
>> wrote:
>>
>>> Is this by any chance associated with issue 579?
>>>
>>>
>>> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>:
>>>
>>>> I'll look at the lines you cite, but that clearly isn't the problem we
>>>> are seeing here. I can verify that because the test case:
>>>>
>>>> mpirun -n 1 sleep 1000
>>>>
>>>> does not open up any connections at all. Thus, the use-case you
>>>> describe never occurs - yet we still blow up in memory. If I simply tell
>>>> the OOB not to set keep alive, the problem goes away.
>>>>
>>>> It only happens on Mac, and we never see Mac based clusters, so turning
>>>> off keep alive on the Mac seems a pretty simple solution.
>>>>
>>>>
>>>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu>
>>>> wrote:
>>>>
>>>>> Ralph,
>>>>>
>>>>> The code pushed in g8e30579 is clearly not the right solution.
>>>>>
>>>>> The problem starts in oob_tcp_listener.c line 742. A new
>>>>> mca_oob_tcp_pending_connection_t object is allocated to store the incoming
>>>>> connection. The accept few lines below fails with an error code of 0x23
>>>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus,
>>>>> the if at line 750 is skipped, and we reach line 763 (a "continue") with 
>>>>> 1)
>>>>> a connection not accepted, and 2) an allocated object not release. Voila!
>>>>>
>>>>> Freeing the pending_connection object is not the right approach
>>>>> either, as it will only remove the memory leak but the process will become
>>>>> a CPU hog.
>>>>>
>>>>>   Thanks,
>>>>>     George.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote:
>>>>>
>>>>>> This is an automated email from the git hooks/post-receive script. It
>>>>>> was
>>>>>> generated because a ref change was pushed to the repository containing
>>>>>> the project "open-mpi/ompi".
>>>>>>
>>>>>> The branch, master has been updated
>>>>>>        via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
>>>>>>       from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)
>>>>>>
>>>>>> Those revisions listed above that are new to this repository have
>>>>>> not appeared on any other notification email; so we list those
>>>>>> revisions in full, below.
>>>>>>
>>>>>> - Log
>>>>>> -----------------------------------------------------------------
>>>>>>
>>>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>>>
>>>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>>> Author: Ralph Castain <r...@open-mpi.org>
>>>>>> Date:   Thu May 14 18:09:13 2015 -0600
>>>>>>
>>>>>>     The Mac appears to have problems with the keepalive support -
>>>>>> once keepalive starts, the memory footprint soars. So disable keepalive 
>>>>>> on
>>>>>> the Mac
>>>>>>
>>>>>> diff --git a/config/opal_check_os_flavors.m4
>>>>>> b/config/opal_check_os_flavors.m4
>>>>>> index d1d124d..4939560 100644
>>>>>> --- a/config/opal_check_os_flavors.m4
>>>>>> +++ b/config/opal_check_os_flavors.m4
>>>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
>>>>>>                         [$opal_have_solaris],
>>>>>>                         [Whether or not we have solaris])
>>>>>>
>>>>>> +    AS_IF([test "$opal_found_apple" = "yes"],
>>>>>> +          [opal_have_mac=1], [opal_have_mac=0])
>>>>>> +    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
>>>>>> +                       [$opal_have_mac],
>>>>>> +                       [Whether or not we are on a Mac])
>>>>>> +
>>>>>>      # check for sockaddr_in (a good sign we have TCP)
>>>>>>      AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
>>>>>>      AC_CHECK_TYPES([struct sockaddr_in],
>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>> b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>> index a768472..e3decf2 100644
>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>> @@ -72,7 +72,7 @@
>>>>>>  /**
>>>>>>   * Set socket buffering
>>>>>>   */
>>>>>> -
>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>>  static void set_keepalive(int sd)
>>>>>>  {
>>>>>>      int option;
>>>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd)
>>>>>>      }
>>>>>>  #endif  // TCP_KEEPCNT
>>>>>>  }
>>>>>> +#endif //SO_KEEPALIVE
>>>>>>
>>>>>>  void orte_oob_tcp_set_socket_options(int sd)
>>>>>>  {
>>>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
>>>>>>                              opal_socket_errno);
>>>>>>      }
>>>>>>  #endif
>>>>>> -#if defined(SO_KEEPALIVE)
>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>>      if (0 < mca_oob_tcp_component.keepalive_time) {
>>>>>>          set_keepalive(sd);
>>>>>>      }
>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>> b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>> index dd1af2a..372ed4c 100644
>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void)
>>>>>>
>>>>>>  &mca_oob_tcp_component.disable_ipv6_family);
>>>>>>  #endif
>>>>>>
>>>>>> -
>>>>>> +#if !OPAL_HAVE_MAC
>>>>>>      mca_oob_tcp_component.keepalive_time = 10;
>>>>>>      (void)mca_base_component_var_register(component,
>>>>>> "keepalive_time",
>>>>>>                                            "Idle time in seconds
>>>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)",
>>>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void)
>>>>>>                                            OPAL_INFO_LVL_9,
>>>>>>
>>>>>>  MCA_BASE_VAR_SCOPE_READONLY,
>>>>>>
>>>>>>  &mca_oob_tcp_component.keepalive_probes);
>>>>>> -
>>>>>> +#endif
>>>>>> +
>>>>>>      mca_oob_tcp_component.retry_delay = 0;
>>>>>>      (void)mca_base_component_var_register(component, "retry_delay",
>>>>>>                                            "Time (in sec) to wait
>>>>>> before trying to connect to peer again",
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------------
>>>>>>
>>>>>> Summary of changes:
>>>>>>  config/opal_check_os_flavors.m4      | 6 ++++++
>>>>>>  orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
>>>>>>  orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
>>>>>>  3 files changed, 12 insertions(+), 4 deletions(-)
>>>>>>
>>>>>>
>>>>>> hooks/post-receive
>>>>>> --
>>>>>> open-mpi/ompi
>>>>>> _______________________________________________
>>>>>> ompi-commits mailing list
>>>>>> ompi-comm...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/05/17405.php
>

Reply via email to