Did some more digging, and it turns out that Linux specifies the keep alive
time interval in seconds - and Mac (for some strange reason) uses
milliseconds. Hence the difference in behavior.

So I could replace the current commit with one that multiplies the keep
alive interval by 1000x if we are on a Mac. However, we don't really need
keep alive at all on the Mac, so I'm wondering if we shouldn't just leave
it turned off?

I confess I don't care either way
Ralph


On Thu, May 14, 2015 at 10:46 PM, George Bosilca <bosi...@icl.utk.edu>
wrote:

> In the worst case, i.e. no other solution is possible, OS X can be
> identified by the existence of the macro __APPLE__. There is no need to
> have OPAL_HAVE_MAC.
>
>   George.
>
> On Thu, May 14, 2015 at 11:12 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Interesting - as I said, I'll take a look. In either case, the keep alive
>> on the Mac is unnecessary as it is always a standalone scenario - no value
>> in running it. So the "fix" does no harm and just saves some useless
>> overhead.
>>
>>
>> On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu>
>> wrote:
>>
>>> I'm sorry Ralph what you proposed is not really a fix. My comment is
>>> based on a real execution of exactly the command you provided with lldb
>>> attached to the process. What I see is millions of OBJ_NEW(
>>> mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly
>>> handled.
>>>
>>>   George.
>>>
>>>
>>> On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org>
>>> wrote:
>>>
>>>> Yes - this is the fix for that issue
>>>>
>>>>
>>>> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is this by any chance associated with issue 579?
>>>>>
>>>>>
>>>>> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>:
>>>>>
>>>>>> I'll look at the lines you cite, but that clearly isn't the problem
>>>>>> we are seeing here. I can verify that because the test case:
>>>>>>
>>>>>> mpirun -n 1 sleep 1000
>>>>>>
>>>>>> does not open up any connections at all. Thus, the use-case you
>>>>>> describe never occurs - yet we still blow up in memory. If I simply tell
>>>>>> the OOB not to set keep alive, the problem goes away.
>>>>>>
>>>>>> It only happens on Mac, and we never see Mac based clusters, so
>>>>>> turning off keep alive on the Mac seems a pretty simple solution.
>>>>>>
>>>>>>
>>>>>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Ralph,
>>>>>>>
>>>>>>> The code pushed in g8e30579 is clearly not the right solution.
>>>>>>>
>>>>>>> The problem starts in oob_tcp_listener.c line 742. A new
>>>>>>> mca_oob_tcp_pending_connection_t object is allocated to store the 
>>>>>>> incoming
>>>>>>> connection. The accept few lines below fails with an error code of 0x23
>>>>>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). 
>>>>>>> Thus,
>>>>>>> the if at line 750 is skipped, and we reach line 763 (a "continue") 
>>>>>>> with 1)
>>>>>>> a connection not accepted, and 2) an allocated object not release. 
>>>>>>> Voila!
>>>>>>>
>>>>>>> Freeing the pending_connection object is not the right approach
>>>>>>> either, as it will only remove the memory leak but the process will 
>>>>>>> become
>>>>>>> a CPU hog.
>>>>>>>
>>>>>>>   Thanks,
>>>>>>>     George.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote:
>>>>>>>
>>>>>>>> This is an automated email from the git hooks/post-receive script.
>>>>>>>> It was
>>>>>>>> generated because a ref change was pushed to the repository
>>>>>>>> containing
>>>>>>>> the project "open-mpi/ompi".
>>>>>>>>
>>>>>>>> The branch, master has been updated
>>>>>>>>        via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
>>>>>>>>       from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)
>>>>>>>>
>>>>>>>> Those revisions listed above that are new to this repository have
>>>>>>>> not appeared on any other notification email; so we list those
>>>>>>>> revisions in full, below.
>>>>>>>>
>>>>>>>> - Log
>>>>>>>> -----------------------------------------------------------------
>>>>>>>>
>>>>>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>>>>>
>>>>>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>>>>> Author: Ralph Castain <r...@open-mpi.org>
>>>>>>>> Date:   Thu May 14 18:09:13 2015 -0600
>>>>>>>>
>>>>>>>>     The Mac appears to have problems with the keepalive support -
>>>>>>>> once keepalive starts, the memory footprint soars. So disable 
>>>>>>>> keepalive on
>>>>>>>> the Mac
>>>>>>>>
>>>>>>>> diff --git a/config/opal_check_os_flavors.m4
>>>>>>>> b/config/opal_check_os_flavors.m4
>>>>>>>> index d1d124d..4939560 100644
>>>>>>>> --- a/config/opal_check_os_flavors.m4
>>>>>>>> +++ b/config/opal_check_os_flavors.m4
>>>>>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
>>>>>>>>                         [$opal_have_solaris],
>>>>>>>>                         [Whether or not we have solaris])
>>>>>>>>
>>>>>>>> +    AS_IF([test "$opal_found_apple" = "yes"],
>>>>>>>> +          [opal_have_mac=1], [opal_have_mac=0])
>>>>>>>> +    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
>>>>>>>> +                       [$opal_have_mac],
>>>>>>>> +                       [Whether or not we are on a Mac])
>>>>>>>> +
>>>>>>>>      # check for sockaddr_in (a good sign we have TCP)
>>>>>>>>      AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
>>>>>>>>      AC_CHECK_TYPES([struct sockaddr_in],
>>>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>>> b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>>> index a768472..e3decf2 100644
>>>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>>> @@ -72,7 +72,7 @@
>>>>>>>>  /**
>>>>>>>>   * Set socket buffering
>>>>>>>>   */
>>>>>>>> -
>>>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>>>>  static void set_keepalive(int sd)
>>>>>>>>  {
>>>>>>>>      int option;
>>>>>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd)
>>>>>>>>      }
>>>>>>>>  #endif  // TCP_KEEPCNT
>>>>>>>>  }
>>>>>>>> +#endif //SO_KEEPALIVE
>>>>>>>>
>>>>>>>>  void orte_oob_tcp_set_socket_options(int sd)
>>>>>>>>  {
>>>>>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
>>>>>>>>                              opal_socket_errno);
>>>>>>>>      }
>>>>>>>>  #endif
>>>>>>>> -#if defined(SO_KEEPALIVE)
>>>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>>>>      if (0 < mca_oob_tcp_component.keepalive_time) {
>>>>>>>>          set_keepalive(sd);
>>>>>>>>      }
>>>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>>> b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>>> index dd1af2a..372ed4c 100644
>>>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void)
>>>>>>>>
>>>>>>>>  &mca_oob_tcp_component.disable_ipv6_family);
>>>>>>>>  #endif
>>>>>>>>
>>>>>>>> -
>>>>>>>> +#if !OPAL_HAVE_MAC
>>>>>>>>      mca_oob_tcp_component.keepalive_time = 10;
>>>>>>>>      (void)mca_base_component_var_register(component,
>>>>>>>> "keepalive_time",
>>>>>>>>                                            "Idle time in seconds
>>>>>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)",
>>>>>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void)
>>>>>>>>                                            OPAL_INFO_LVL_9,
>>>>>>>>
>>>>>>>>  MCA_BASE_VAR_SCOPE_READONLY,
>>>>>>>>
>>>>>>>>  &mca_oob_tcp_component.keepalive_probes);
>>>>>>>> -
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>      mca_oob_tcp_component.retry_delay = 0;
>>>>>>>>      (void)mca_base_component_var_register(component, "retry_delay",
>>>>>>>>                                            "Time (in sec) to wait
>>>>>>>> before trying to connect to peer again",
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----------------------------------------------------------------------
>>>>>>>>
>>>>>>>> Summary of changes:
>>>>>>>>  config/opal_check_os_flavors.m4      | 6 ++++++
>>>>>>>>  orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
>>>>>>>>  orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
>>>>>>>>  3 files changed, 12 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>>
>>>>>>>> hooks/post-receive
>>>>>>>> --
>>>>>>>> open-mpi/ompi
>>>>>>>> _______________________________________________
>>>>>>>> ompi-commits mailing list
>>>>>>>> ompi-comm...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/05/17405.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/05/17406.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/05/17407.php
>

Reply via email to