Interesting - as I said, I'll take a look. In either case, the keep alive on the Mac is unnecessary as it is always a standalone scenario - no value in running it. So the "fix" does no harm and just saves some useless overhead.
On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > I'm sorry Ralph what you proposed is not really a fix. My comment is based > on a real execution of exactly the command you provided with lldb attached > to the process. What I see is millions of > OBJ_NEW(mca_oob_tcp_pending_connection_t) > because the EAGAIN is not correctly handled. > > George. > > > On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Yes - this is the fix for that issue >> >> >> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com> >> wrote: >> >>> Is this by any chance associated with issue 579? >>> >>> >>> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>: >>> >>>> I'll look at the lines you cite, but that clearly isn't the problem we >>>> are seeing here. I can verify that because the test case: >>>> >>>> mpirun -n 1 sleep 1000 >>>> >>>> does not open up any connections at all. Thus, the use-case you >>>> describe never occurs - yet we still blow up in memory. If I simply tell >>>> the OOB not to set keep alive, the problem goes away. >>>> >>>> It only happens on Mac, and we never see Mac based clusters, so turning >>>> off keep alive on the Mac seems a pretty simple solution. >>>> >>>> >>>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu> >>>> wrote: >>>> >>>>> Ralph, >>>>> >>>>> The code pushed in g8e30579 is clearly not the right solution. >>>>> >>>>> The problem starts in oob_tcp_listener.c line 742. A new >>>>> mca_oob_tcp_pending_connection_t object is allocated to store the incoming >>>>> connection. The accept few lines below fails with an error code of 0x23 >>>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, >>>>> the if at line 750 is skipped, and we reach line 763 (a "continue") with >>>>> 1) >>>>> a connection not accepted, and 2) an allocated object not release. Voila! >>>>> >>>>> Freeing the pending_connection object is not the right approach >>>>> either, as it will only remove the memory leak but the process will become >>>>> a CPU hog. >>>>> >>>>> Thanks, >>>>> George. >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote: >>>>> >>>>>> This is an automated email from the git hooks/post-receive script. It >>>>>> was >>>>>> generated because a ref change was pushed to the repository containing >>>>>> the project "open-mpi/ompi". >>>>>> >>>>>> The branch, master has been updated >>>>>> via 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit) >>>>>> from 1488e82efd1d09c30ba46dfa00b89e623623272f (commit) >>>>>> >>>>>> Those revisions listed above that are new to this repository have >>>>>> not appeared on any other notification email; so we list those >>>>>> revisions in full, below. >>>>>> >>>>>> - Log >>>>>> ----------------------------------------------------------------- >>>>>> >>>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>>> >>>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>>> Author: Ralph Castain <r...@open-mpi.org> >>>>>> Date: Thu May 14 18:09:13 2015 -0600 >>>>>> >>>>>> The Mac appears to have problems with the keepalive support - >>>>>> once keepalive starts, the memory footprint soars. So disable keepalive >>>>>> on >>>>>> the Mac >>>>>> >>>>>> diff --git a/config/opal_check_os_flavors.m4 >>>>>> b/config/opal_check_os_flavors.m4 >>>>>> index d1d124d..4939560 100644 >>>>>> --- a/config/opal_check_os_flavors.m4 >>>>>> +++ b/config/opal_check_os_flavors.m4 >>>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS], >>>>>> [$opal_have_solaris], >>>>>> [Whether or not we have solaris]) >>>>>> >>>>>> + AS_IF([test "$opal_found_apple" = "yes"], >>>>>> + [opal_have_mac=1], [opal_have_mac=0]) >>>>>> + AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC], >>>>>> + [$opal_have_mac], >>>>>> + [Whether or not we are on a Mac]) >>>>>> + >>>>>> # check for sockaddr_in (a good sign we have TCP) >>>>>> AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h]) >>>>>> AC_CHECK_TYPES([struct sockaddr_in], >>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c >>>>>> b/orte/mca/oob/tcp/oob_tcp_common.c >>>>>> index a768472..e3decf2 100644 >>>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c >>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c >>>>>> @@ -72,7 +72,7 @@ >>>>>> /** >>>>>> * Set socket buffering >>>>>> */ >>>>>> - >>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>>> static void set_keepalive(int sd) >>>>>> { >>>>>> int option; >>>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd) >>>>>> } >>>>>> #endif // TCP_KEEPCNT >>>>>> } >>>>>> +#endif //SO_KEEPALIVE >>>>>> >>>>>> void orte_oob_tcp_set_socket_options(int sd) >>>>>> { >>>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd) >>>>>> opal_socket_errno); >>>>>> } >>>>>> #endif >>>>>> -#if defined(SO_KEEPALIVE) >>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>>> if (0 < mca_oob_tcp_component.keepalive_time) { >>>>>> set_keepalive(sd); >>>>>> } >>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c >>>>>> b/orte/mca/oob/tcp/oob_tcp_component.c >>>>>> index dd1af2a..372ed4c 100644 >>>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c >>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c >>>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void) >>>>>> >>>>>> &mca_oob_tcp_component.disable_ipv6_family); >>>>>> #endif >>>>>> >>>>>> - >>>>>> +#if !OPAL_HAVE_MAC >>>>>> mca_oob_tcp_component.keepalive_time = 10; >>>>>> (void)mca_base_component_var_register(component, >>>>>> "keepalive_time", >>>>>> "Idle time in seconds >>>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)", >>>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void) >>>>>> OPAL_INFO_LVL_9, >>>>>> >>>>>> MCA_BASE_VAR_SCOPE_READONLY, >>>>>> >>>>>> &mca_oob_tcp_component.keepalive_probes); >>>>>> - >>>>>> +#endif >>>>>> + >>>>>> mca_oob_tcp_component.retry_delay = 0; >>>>>> (void)mca_base_component_var_register(component, "retry_delay", >>>>>> "Time (in sec) to wait >>>>>> before trying to connect to peer again", >>>>>> >>>>>> >>>>>> >>>>>> ----------------------------------------------------------------------- >>>>>> >>>>>> Summary of changes: >>>>>> config/opal_check_os_flavors.m4 | 6 ++++++ >>>>>> orte/mca/oob/tcp/oob_tcp_common.c | 5 +++-- >>>>>> orte/mca/oob/tcp/oob_tcp_component.c | 5 +++-- >>>>>> 3 files changed, 12 insertions(+), 4 deletions(-) >>>>>> >>>>>> >>>>>> hooks/post-receive >>>>>> -- >>>>>> open-mpi/ompi >>>>>> _______________________________________________ >>>>>> ompi-commits mailing list >>>>>> ompi-comm...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17405.php >