I'll look at the lines you cite, but that clearly isn't the problem we are seeing here. I can verify that because the test case:
mpirun -n 1 sleep 1000 does not open up any connections at all. Thus, the use-case you describe never occurs - yet we still blow up in memory. If I simply tell the OOB not to set keep alive, the problem goes away. It only happens on Mac, and we never see Mac based clusters, so turning off keep alive on the Mac seems a pretty simple solution. On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > Ralph, > > The code pushed in g8e30579 is clearly not the right solution. > > The problem starts in oob_tcp_listener.c line 742. A new > mca_oob_tcp_pending_connection_t object is allocated to store the incoming > connection. The accept few lines below fails with an error code of 0x23 > which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, > the if at line 750 is skipped, and we reach line 763 (a "continue") with 1) > a connection not accepted, and 2) an allocated object not release. Voila! > > Freeing the pending_connection object is not the right approach either, as > it will only remove the memory leak but the process will become a CPU hog. > > Thanks, > George. > > > > > On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote: > >> This is an automated email from the git hooks/post-receive script. It was >> generated because a ref change was pushed to the repository containing >> the project "open-mpi/ompi". >> >> The branch, master has been updated >> via 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit) >> from 1488e82efd1d09c30ba46dfa00b89e623623272f (commit) >> >> Those revisions listed above that are new to this repository have >> not appeared on any other notification email; so we list those >> revisions in full, below. >> >> - Log ----------------------------------------------------------------- >> >> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >> >> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >> Author: Ralph Castain <r...@open-mpi.org> >> Date: Thu May 14 18:09:13 2015 -0600 >> >> The Mac appears to have problems with the keepalive support - once >> keepalive starts, the memory footprint soars. So disable keepalive on the >> Mac >> >> diff --git a/config/opal_check_os_flavors.m4 >> b/config/opal_check_os_flavors.m4 >> index d1d124d..4939560 100644 >> --- a/config/opal_check_os_flavors.m4 >> +++ b/config/opal_check_os_flavors.m4 >> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS], >> [$opal_have_solaris], >> [Whether or not we have solaris]) >> >> + AS_IF([test "$opal_found_apple" = "yes"], >> + [opal_have_mac=1], [opal_have_mac=0]) >> + AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC], >> + [$opal_have_mac], >> + [Whether or not we are on a Mac]) >> + >> # check for sockaddr_in (a good sign we have TCP) >> AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h]) >> AC_CHECK_TYPES([struct sockaddr_in], >> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c >> b/orte/mca/oob/tcp/oob_tcp_common.c >> index a768472..e3decf2 100644 >> --- a/orte/mca/oob/tcp/oob_tcp_common.c >> +++ b/orte/mca/oob/tcp/oob_tcp_common.c >> @@ -72,7 +72,7 @@ >> /** >> * Set socket buffering >> */ >> - >> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >> static void set_keepalive(int sd) >> { >> int option; >> @@ -146,6 +146,7 @@ static void set_keepalive(int sd) >> } >> #endif // TCP_KEEPCNT >> } >> +#endif //SO_KEEPALIVE >> >> void orte_oob_tcp_set_socket_options(int sd) >> { >> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd) >> opal_socket_errno); >> } >> #endif >> -#if defined(SO_KEEPALIVE) >> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >> if (0 < mca_oob_tcp_component.keepalive_time) { >> set_keepalive(sd); >> } >> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c >> b/orte/mca/oob/tcp/oob_tcp_component.c >> index dd1af2a..372ed4c 100644 >> --- a/orte/mca/oob/tcp/oob_tcp_component.c >> +++ b/orte/mca/oob/tcp/oob_tcp_component.c >> @@ -404,7 +404,7 @@ static int tcp_component_register(void) >> >> &mca_oob_tcp_component.disable_ipv6_family); >> #endif >> >> - >> +#if !OPAL_HAVE_MAC >> mca_oob_tcp_component.keepalive_time = 10; >> (void)mca_base_component_var_register(component, "keepalive_time", >> "Idle time in seconds before >> starting to send keepalives (num <= 0 ----> disable keepalive)", >> @@ -427,7 +427,8 @@ static int tcp_component_register(void) >> OPAL_INFO_LVL_9, >> MCA_BASE_VAR_SCOPE_READONLY, >> >> &mca_oob_tcp_component.keepalive_probes); >> - >> +#endif >> + >> mca_oob_tcp_component.retry_delay = 0; >> (void)mca_base_component_var_register(component, "retry_delay", >> "Time (in sec) to wait before >> trying to connect to peer again", >> >> >> ----------------------------------------------------------------------- >> >> Summary of changes: >> config/opal_check_os_flavors.m4 | 6 ++++++ >> orte/mca/oob/tcp/oob_tcp_common.c | 5 +++-- >> orte/mca/oob/tcp/oob_tcp_component.c | 5 +++-- >> 3 files changed, 12 insertions(+), 4 deletions(-) >> >> >> hooks/post-receive >> -- >> open-mpi/ompi >> _______________________________________________ >> ompi-commits mailing list >> ompi-comm...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17401.php >