Good catch. If vote for the same behavior on OS X even if it's somewhat unnecessary. I.E., use keep alive, but do 1000x the value.
Sent from my phone. No type good. On May 15, 2015, at 5:42 AM, Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote: Did some more digging, and it turns out that Linux specifies the keep alive time interval in seconds - and Mac (for some strange reason) uses milliseconds. Hence the difference in behavior. So I could replace the current commit with one that multiplies the keep alive interval by 1000x if we are on a Mac. However, we don't really need keep alive at all on the Mac, so I'm wondering if we shouldn't just leave it turned off? I confess I don't care either way Ralph On Thu, May 14, 2015 at 10:46 PM, George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote: In the worst case, i.e. no other solution is possible, OS X can be identified by the existence of the macro __APPLE__. There is no need to have OPAL_HAVE_MAC. George. On Thu, May 14, 2015 at 11:12 PM, Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote: Interesting - as I said, I'll take a look. In either case, the keep alive on the Mac is unnecessary as it is always a standalone scenario - no value in running it. So the "fix" does no harm and just saves some useless overhead. On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote: I'm sorry Ralph what you proposed is not really a fix. My comment is based on a real execution of exactly the command you provided with lldb attached to the process. What I see is millions of OBJ_NEW(mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly handled. George. On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote: Yes - this is the fix for that issue On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com<mailto:hpprit...@gmail.com>> wrote: Is this by any chance associated with issue 579? 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>: I'll look at the lines you cite, but that clearly isn't the problem we are seeing here. I can verify that because the test case: mpirun -n 1 sleep 1000 does not open up any connections at all. Thus, the use-case you describe never occurs - yet we still blow up in memory. If I simply tell the OOB not to set keep alive, the problem goes away. It only happens on Mac, and we never see Mac based clusters, so turning off keep alive on the Mac seems a pretty simple solution. On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote: Ralph, The code pushed in g8e30579 is clearly not the right solution. The problem starts in oob_tcp_listener.c line 742. A new mca_oob_tcp_pending_connection_t object is allocated to store the incoming connection. The accept few lines below fails with an error code of 0x23 which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, the if at line 750 is skipped, and we reach line 763 (a "continue") with 1) a connection not accepted, and 2) an allocated object not release. Voila! Freeing the pending_connection object is not the right approach either, as it will only remove the memory leak but the process will become a CPU hog. Thanks, George. On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu<mailto:git...@crest.iu.edu>> wrote: This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "open-mpi/ompi". The branch, master has been updated via 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit) from 1488e82efd1d09c30ba46dfa00b89e623623272f (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef Author: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> List-Post: devel@lists.open-mpi.org Date: Thu May 14 18:09:13 2015 -0600 The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac diff --git a/config/opal_check_os_flavors.m4 b/config/opal_check_os_flavors.m4 index d1d124d..4939560 100644 --- a/config/opal_check_os_flavors.m4 +++ b/config/opal_check_os_flavors.m4 @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS], [$opal_have_solaris], [Whether or not we have solaris]) + AS_IF([test "$opal_found_apple" = "yes"], + [opal_have_mac=1], [opal_have_mac=0]) + AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC], + [$opal_have_mac], + [Whether or not we are on a Mac]) + # check for sockaddr_in (a good sign we have TCP) AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h]) AC_CHECK_TYPES([struct sockaddr_in], diff --git a/orte/mca/oob/tcp/oob_tcp_common.c b/orte/mca/oob/tcp/oob_tcp_common.c index a768472..e3decf2 100644 --- a/orte/mca/oob/tcp/oob_tcp_common.c +++ b/orte/mca/oob/tcp/oob_tcp_common.c @@ -72,7 +72,7 @@ /** * Set socket buffering */ - +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC static void set_keepalive(int sd) { int option; @@ -146,6 +146,7 @@ static void set_keepalive(int sd) } #endif // TCP_KEEPCNT } +#endif //SO_KEEPALIVE void orte_oob_tcp_set_socket_options(int sd) { @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd) opal_socket_errno); } #endif -#if defined(SO_KEEPALIVE) +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC if (0 < mca_oob_tcp_component.keepalive_time) { set_keepalive(sd); } diff --git a/orte/mca/oob/tcp/oob_tcp_component.c b/orte/mca/oob/tcp/oob_tcp_component.c index dd1af2a..372ed4c 100644 --- a/orte/mca/oob/tcp/oob_tcp_component.c +++ b/orte/mca/oob/tcp/oob_tcp_component.c @@ -404,7 +404,7 @@ static int tcp_component_register(void) &mca_oob_tcp_component.disable_ipv6_family); #endif - +#if !OPAL_HAVE_MAC mca_oob_tcp_component.keepalive_time = 10; (void)mca_base_component_var_register(component, "keepalive_time", "Idle time in seconds before starting to send keepalives (num <= 0 ----> disable keepalive)", @@ -427,7 +427,8 @@ static int tcp_component_register(void) OPAL_INFO_LVL_9, MCA_BASE_VAR_SCOPE_READONLY, &mca_oob_tcp_component.keepalive_probes); - +#endif + mca_oob_tcp_component.retry_delay = 0; (void)mca_base_component_var_register(component, "retry_delay", "Time (in sec) to wait before trying to connect to peer again", ----------------------------------------------------------------------- Summary of changes: config/opal_check_os_flavors.m4 | 6 ++++++ orte/mca/oob/tcp/oob_tcp_common.c | 5 +++-- orte/mca/oob/tcp/oob_tcp_component.c | 5 +++-- 3 files changed, 12 insertions(+), 4 deletions(-) hooks/post-receive -- open-mpi/ompi _______________________________________________ ompi-commits mailing list ompi-comm...@open-mpi.org<mailto:ompi-comm...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17401.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17402.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17403.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17404.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17405.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17406.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17407.php _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/05/17408.php