I'm sorry Ralph what you proposed is not really a fix. My comment is based on a real execution of exactly the command you provided with lldb attached to the process. What I see is millions of OBJ_NEW(mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly handled.
George. On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote: > Yes - this is the fix for that issue > > > On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com> > wrote: > >> Is this by any chance associated with issue 579? >> >> >> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>: >> >>> I'll look at the lines you cite, but that clearly isn't the problem we >>> are seeing here. I can verify that because the test case: >>> >>> mpirun -n 1 sleep 1000 >>> >>> does not open up any connections at all. Thus, the use-case you describe >>> never occurs - yet we still blow up in memory. If I simply tell the OOB not >>> to set keep alive, the problem goes away. >>> >>> It only happens on Mac, and we never see Mac based clusters, so turning >>> off keep alive on the Mac seems a pretty simple solution. >>> >>> >>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu> >>> wrote: >>> >>>> Ralph, >>>> >>>> The code pushed in g8e30579 is clearly not the right solution. >>>> >>>> The problem starts in oob_tcp_listener.c line 742. A new >>>> mca_oob_tcp_pending_connection_t object is allocated to store the incoming >>>> connection. The accept few lines below fails with an error code of 0x23 >>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, >>>> the if at line 750 is skipped, and we reach line 763 (a "continue") with 1) >>>> a connection not accepted, and 2) an allocated object not release. Voila! >>>> >>>> Freeing the pending_connection object is not the right approach either, >>>> as it will only remove the memory leak but the process will become a CPU >>>> hog. >>>> >>>> Thanks, >>>> George. >>>> >>>> >>>> >>>> >>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote: >>>> >>>>> This is an automated email from the git hooks/post-receive script. It >>>>> was >>>>> generated because a ref change was pushed to the repository containing >>>>> the project "open-mpi/ompi". >>>>> >>>>> The branch, master has been updated >>>>> via 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit) >>>>> from 1488e82efd1d09c30ba46dfa00b89e623623272f (commit) >>>>> >>>>> Those revisions listed above that are new to this repository have >>>>> not appeared on any other notification email; so we list those >>>>> revisions in full, below. >>>>> >>>>> - Log ----------------------------------------------------------------- >>>>> >>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>> >>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>> Author: Ralph Castain <r...@open-mpi.org> >>>>> Date: Thu May 14 18:09:13 2015 -0600 >>>>> >>>>> The Mac appears to have problems with the keepalive support - once >>>>> keepalive starts, the memory footprint soars. So disable keepalive on the >>>>> Mac >>>>> >>>>> diff --git a/config/opal_check_os_flavors.m4 >>>>> b/config/opal_check_os_flavors.m4 >>>>> index d1d124d..4939560 100644 >>>>> --- a/config/opal_check_os_flavors.m4 >>>>> +++ b/config/opal_check_os_flavors.m4 >>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS], >>>>> [$opal_have_solaris], >>>>> [Whether or not we have solaris]) >>>>> >>>>> + AS_IF([test "$opal_found_apple" = "yes"], >>>>> + [opal_have_mac=1], [opal_have_mac=0]) >>>>> + AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC], >>>>> + [$opal_have_mac], >>>>> + [Whether or not we are on a Mac]) >>>>> + >>>>> # check for sockaddr_in (a good sign we have TCP) >>>>> AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h]) >>>>> AC_CHECK_TYPES([struct sockaddr_in], >>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c >>>>> b/orte/mca/oob/tcp/oob_tcp_common.c >>>>> index a768472..e3decf2 100644 >>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c >>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c >>>>> @@ -72,7 +72,7 @@ >>>>> /** >>>>> * Set socket buffering >>>>> */ >>>>> - >>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>> static void set_keepalive(int sd) >>>>> { >>>>> int option; >>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd) >>>>> } >>>>> #endif // TCP_KEEPCNT >>>>> } >>>>> +#endif //SO_KEEPALIVE >>>>> >>>>> void orte_oob_tcp_set_socket_options(int sd) >>>>> { >>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd) >>>>> opal_socket_errno); >>>>> } >>>>> #endif >>>>> -#if defined(SO_KEEPALIVE) >>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>> if (0 < mca_oob_tcp_component.keepalive_time) { >>>>> set_keepalive(sd); >>>>> } >>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c >>>>> b/orte/mca/oob/tcp/oob_tcp_component.c >>>>> index dd1af2a..372ed4c 100644 >>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c >>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c >>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void) >>>>> >>>>> &mca_oob_tcp_component.disable_ipv6_family); >>>>> #endif >>>>> >>>>> - >>>>> +#if !OPAL_HAVE_MAC >>>>> mca_oob_tcp_component.keepalive_time = 10; >>>>> (void)mca_base_component_var_register(component, "keepalive_time", >>>>> "Idle time in seconds >>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)", >>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void) >>>>> OPAL_INFO_LVL_9, >>>>> MCA_BASE_VAR_SCOPE_READONLY, >>>>> >>>>> &mca_oob_tcp_component.keepalive_probes); >>>>> - >>>>> +#endif >>>>> + >>>>> mca_oob_tcp_component.retry_delay = 0; >>>>> (void)mca_base_component_var_register(component, "retry_delay", >>>>> "Time (in sec) to wait >>>>> before trying to connect to peer again", >>>>> >>>>> >>>>> ----------------------------------------------------------------------- >>>>> >>>>> Summary of changes: >>>>> config/opal_check_os_flavors.m4 | 6 ++++++ >>>>> orte/mca/oob/tcp/oob_tcp_common.c | 5 +++-- >>>>> orte/mca/oob/tcp/oob_tcp_component.c | 5 +++-- >>>>> 3 files changed, 12 insertions(+), 4 deletions(-) >>>>> >>>>> >>>>> hooks/post-receive >>>>> -- >>>>> open-mpi/ompi >>>>> _______________________________________________ >>>>> ompi-commits mailing list >>>>> ompi-comm...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17404.php >