In the worst case, i.e. no other solution is possible, OS X can be
identified by the existence of the macro __APPLE__. There is no need to
have OPAL_HAVE_MAC.

  George.

On Thu, May 14, 2015 at 11:12 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Interesting - as I said, I'll take a look. In either case, the keep alive
> on the Mac is unnecessary as it is always a standalone scenario - no value
> in running it. So the "fix" does no harm and just saves some useless
> overhead.
>
>
> On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> I'm sorry Ralph what you proposed is not really a fix. My comment is
>> based on a real execution of exactly the command you provided with lldb
>> attached to the process. What I see is millions of OBJ_NEW(
>> mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly
>> handled.
>>
>>   George.
>>
>>
>> On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Yes - this is the fix for that issue
>>>
>>>
>>> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com>
>>> wrote:
>>>
>>>> Is this by any chance associated with issue 579?
>>>>
>>>>
>>>> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>:
>>>>
>>>>> I'll look at the lines you cite, but that clearly isn't the problem we
>>>>> are seeing here. I can verify that because the test case:
>>>>>
>>>>> mpirun -n 1 sleep 1000
>>>>>
>>>>> does not open up any connections at all. Thus, the use-case you
>>>>> describe never occurs - yet we still blow up in memory. If I simply tell
>>>>> the OOB not to set keep alive, the problem goes away.
>>>>>
>>>>> It only happens on Mac, and we never see Mac based clusters, so
>>>>> turning off keep alive on the Mac seems a pretty simple solution.
>>>>>
>>>>>
>>>>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu>
>>>>> wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> The code pushed in g8e30579 is clearly not the right solution.
>>>>>>
>>>>>> The problem starts in oob_tcp_listener.c line 742. A new
>>>>>> mca_oob_tcp_pending_connection_t object is allocated to store the 
>>>>>> incoming
>>>>>> connection. The accept few lines below fails with an error code of 0x23
>>>>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus,
>>>>>> the if at line 750 is skipped, and we reach line 763 (a "continue") with 
>>>>>> 1)
>>>>>> a connection not accepted, and 2) an allocated object not release. Voila!
>>>>>>
>>>>>> Freeing the pending_connection object is not the right approach
>>>>>> either, as it will only remove the memory leak but the process will 
>>>>>> become
>>>>>> a CPU hog.
>>>>>>
>>>>>>   Thanks,
>>>>>>     George.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote:
>>>>>>
>>>>>>> This is an automated email from the git hooks/post-receive script.
>>>>>>> It was
>>>>>>> generated because a ref change was pushed to the repository
>>>>>>> containing
>>>>>>> the project "open-mpi/ompi".
>>>>>>>
>>>>>>> The branch, master has been updated
>>>>>>>        via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
>>>>>>>       from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)
>>>>>>>
>>>>>>> Those revisions listed above that are new to this repository have
>>>>>>> not appeared on any other notification email; so we list those
>>>>>>> revisions in full, below.
>>>>>>>
>>>>>>> - Log
>>>>>>> -----------------------------------------------------------------
>>>>>>>
>>>>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>>>>
>>>>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>>>> Author: Ralph Castain <r...@open-mpi.org>
>>>>>>> Date:   Thu May 14 18:09:13 2015 -0600
>>>>>>>
>>>>>>>     The Mac appears to have problems with the keepalive support -
>>>>>>> once keepalive starts, the memory footprint soars. So disable keepalive 
>>>>>>> on
>>>>>>> the Mac
>>>>>>>
>>>>>>> diff --git a/config/opal_check_os_flavors.m4
>>>>>>> b/config/opal_check_os_flavors.m4
>>>>>>> index d1d124d..4939560 100644
>>>>>>> --- a/config/opal_check_os_flavors.m4
>>>>>>> +++ b/config/opal_check_os_flavors.m4
>>>>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
>>>>>>>                         [$opal_have_solaris],
>>>>>>>                         [Whether or not we have solaris])
>>>>>>>
>>>>>>> +    AS_IF([test "$opal_found_apple" = "yes"],
>>>>>>> +          [opal_have_mac=1], [opal_have_mac=0])
>>>>>>> +    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
>>>>>>> +                       [$opal_have_mac],
>>>>>>> +                       [Whether or not we are on a Mac])
>>>>>>> +
>>>>>>>      # check for sockaddr_in (a good sign we have TCP)
>>>>>>>      AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
>>>>>>>      AC_CHECK_TYPES([struct sockaddr_in],
>>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>> b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>> index a768472..e3decf2 100644
>>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>>>> @@ -72,7 +72,7 @@
>>>>>>>  /**
>>>>>>>   * Set socket buffering
>>>>>>>   */
>>>>>>> -
>>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>>>  static void set_keepalive(int sd)
>>>>>>>  {
>>>>>>>      int option;
>>>>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd)
>>>>>>>      }
>>>>>>>  #endif  // TCP_KEEPCNT
>>>>>>>  }
>>>>>>> +#endif //SO_KEEPALIVE
>>>>>>>
>>>>>>>  void orte_oob_tcp_set_socket_options(int sd)
>>>>>>>  {
>>>>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
>>>>>>>                              opal_socket_errno);
>>>>>>>      }
>>>>>>>  #endif
>>>>>>> -#if defined(SO_KEEPALIVE)
>>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>>>      if (0 < mca_oob_tcp_component.keepalive_time) {
>>>>>>>          set_keepalive(sd);
>>>>>>>      }
>>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>> b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>> index dd1af2a..372ed4c 100644
>>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void)
>>>>>>>
>>>>>>>  &mca_oob_tcp_component.disable_ipv6_family);
>>>>>>>  #endif
>>>>>>>
>>>>>>> -
>>>>>>> +#if !OPAL_HAVE_MAC
>>>>>>>      mca_oob_tcp_component.keepalive_time = 10;
>>>>>>>      (void)mca_base_component_var_register(component,
>>>>>>> "keepalive_time",
>>>>>>>                                            "Idle time in seconds
>>>>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)",
>>>>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void)
>>>>>>>                                            OPAL_INFO_LVL_9,
>>>>>>>
>>>>>>>  MCA_BASE_VAR_SCOPE_READONLY,
>>>>>>>
>>>>>>>  &mca_oob_tcp_component.keepalive_probes);
>>>>>>> -
>>>>>>> +#endif
>>>>>>> +
>>>>>>>      mca_oob_tcp_component.retry_delay = 0;
>>>>>>>      (void)mca_base_component_var_register(component, "retry_delay",
>>>>>>>                                            "Time (in sec) to wait
>>>>>>> before trying to connect to peer again",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------------
>>>>>>>
>>>>>>> Summary of changes:
>>>>>>>  config/opal_check_os_flavors.m4      | 6 ++++++
>>>>>>>  orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
>>>>>>>  orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
>>>>>>>  3 files changed, 12 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>>
>>>>>>> hooks/post-receive
>>>>>>> --
>>>>>>> open-mpi/ompi
>>>>>>> _______________________________________________
>>>>>>> ompi-commits mailing list
>>>>>>> ompi-comm...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/05/17405.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/05/17406.php
>

Reply via email to