I'm sorry Ralph what you proposed is not really a fix. My comment is based
on a real execution of exactly the command you provided with lldb attached
to the process. What I see is millions of
OBJ_NEW(mca_oob_tcp_pending_connection_t)
because the EAGAIN is not correctly handled.

  George.


On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Yes - this is the fix for that issue
>
>
> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> Is this by any chance associated with issue 579?
>>
>>
>> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>:
>>
>>> I'll look at the lines you cite, but that clearly isn't the problem we
>>> are seeing here. I can verify that because the test case:
>>>
>>> mpirun -n 1 sleep 1000
>>>
>>> does not open up any connections at all. Thus, the use-case you describe
>>> never occurs - yet we still blow up in memory. If I simply tell the OOB not
>>> to set keep alive, the problem goes away.
>>>
>>> It only happens on Mac, and we never see Mac based clusters, so turning
>>> off keep alive on the Mac seems a pretty simple solution.
>>>
>>>
>>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu>
>>> wrote:
>>>
>>>> Ralph,
>>>>
>>>> The code pushed in g8e30579 is clearly not the right solution.
>>>>
>>>> The problem starts in oob_tcp_listener.c line 742. A new
>>>> mca_oob_tcp_pending_connection_t object is allocated to store the incoming
>>>> connection. The accept few lines below fails with an error code of 0x23
>>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus,
>>>> the if at line 750 is skipped, and we reach line 763 (a "continue") with 1)
>>>> a connection not accepted, and 2) an allocated object not release. Voila!
>>>>
>>>> Freeing the pending_connection object is not the right approach either,
>>>> as it will only remove the memory leak but the process will become a CPU
>>>> hog.
>>>>
>>>>   Thanks,
>>>>     George.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote:
>>>>
>>>>> This is an automated email from the git hooks/post-receive script. It
>>>>> was
>>>>> generated because a ref change was pushed to the repository containing
>>>>> the project "open-mpi/ompi".
>>>>>
>>>>> The branch, master has been updated
>>>>>        via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
>>>>>       from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)
>>>>>
>>>>> Those revisions listed above that are new to this repository have
>>>>> not appeared on any other notification email; so we list those
>>>>> revisions in full, below.
>>>>>
>>>>> - Log -----------------------------------------------------------------
>>>>>
>>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>>
>>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>>> Author: Ralph Castain <r...@open-mpi.org>
>>>>> Date:   Thu May 14 18:09:13 2015 -0600
>>>>>
>>>>>     The Mac appears to have problems with the keepalive support - once
>>>>> keepalive starts, the memory footprint soars. So disable keepalive on the
>>>>> Mac
>>>>>
>>>>> diff --git a/config/opal_check_os_flavors.m4
>>>>> b/config/opal_check_os_flavors.m4
>>>>> index d1d124d..4939560 100644
>>>>> --- a/config/opal_check_os_flavors.m4
>>>>> +++ b/config/opal_check_os_flavors.m4
>>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
>>>>>                         [$opal_have_solaris],
>>>>>                         [Whether or not we have solaris])
>>>>>
>>>>> +    AS_IF([test "$opal_found_apple" = "yes"],
>>>>> +          [opal_have_mac=1], [opal_have_mac=0])
>>>>> +    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
>>>>> +                       [$opal_have_mac],
>>>>> +                       [Whether or not we are on a Mac])
>>>>> +
>>>>>      # check for sockaddr_in (a good sign we have TCP)
>>>>>      AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
>>>>>      AC_CHECK_TYPES([struct sockaddr_in],
>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>> b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>> index a768472..e3decf2 100644
>>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c
>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c
>>>>> @@ -72,7 +72,7 @@
>>>>>  /**
>>>>>   * Set socket buffering
>>>>>   */
>>>>> -
>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>  static void set_keepalive(int sd)
>>>>>  {
>>>>>      int option;
>>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd)
>>>>>      }
>>>>>  #endif  // TCP_KEEPCNT
>>>>>  }
>>>>> +#endif //SO_KEEPALIVE
>>>>>
>>>>>  void orte_oob_tcp_set_socket_options(int sd)
>>>>>  {
>>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
>>>>>                              opal_socket_errno);
>>>>>      }
>>>>>  #endif
>>>>> -#if defined(SO_KEEPALIVE)
>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>>>      if (0 < mca_oob_tcp_component.keepalive_time) {
>>>>>          set_keepalive(sd);
>>>>>      }
>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>> b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>> index dd1af2a..372ed4c 100644
>>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c
>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c
>>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void)
>>>>>
>>>>>  &mca_oob_tcp_component.disable_ipv6_family);
>>>>>  #endif
>>>>>
>>>>> -
>>>>> +#if !OPAL_HAVE_MAC
>>>>>      mca_oob_tcp_component.keepalive_time = 10;
>>>>>      (void)mca_base_component_var_register(component, "keepalive_time",
>>>>>                                            "Idle time in seconds
>>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)",
>>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void)
>>>>>                                            OPAL_INFO_LVL_9,
>>>>>                                            MCA_BASE_VAR_SCOPE_READONLY,
>>>>>
>>>>>  &mca_oob_tcp_component.keepalive_probes);
>>>>> -
>>>>> +#endif
>>>>> +
>>>>>      mca_oob_tcp_component.retry_delay = 0;
>>>>>      (void)mca_base_component_var_register(component, "retry_delay",
>>>>>                                            "Time (in sec) to wait
>>>>> before trying to connect to peer again",
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>> Summary of changes:
>>>>>  config/opal_check_os_flavors.m4      | 6 ++++++
>>>>>  orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
>>>>>  orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
>>>>>  3 files changed, 12 insertions(+), 4 deletions(-)
>>>>>
>>>>>
>>>>> hooks/post-receive
>>>>> --
>>>>> open-mpi/ompi
>>>>> _______________________________________________
>>>>> ompi-commits mailing list
>>>>> ompi-comm...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php
>

Reply via email to