I looked at this in a bit more detail this morning. SHORT VERSION -------------
I think that the real issue is that we shouldn't be setting KEEPALIVE on the listening sockets (we should only be setting these values on accepted/connected sockets). I submitted a PR for this: https://github.com/open-mpi/ompi/pull/588; there are several commits on it, the key commit is https://github.com/jsquyres/ompi/commit/ece484c. The OS X kernel reports its default keep alive values in milliseconds, and the Linux kernel reports its default values in seconds. But I can't find definitive references to the units for setsoctopt(TCP_KEEPALIVE) (and friends) -- empirical testing on OS X shows that TCP_KEEPALIVE is definitely specified in *seconds*. MORE DETAIL ----------- We only need to set KEEPALIVE on the accepted/connected sockets; there's no point in setting KEEPALIVE -- or any of our usual socket options (e.g., RCVBUF, NODELAY, etc.) -- on the listening sockets because the listening sockets are not used to send/receive anything. Specifically: not setting these values for the listening sockets avoids the memory leak that started this whole kerfuffle. As for whether the units are in seconds vs. milliseconds: I found references to how the Linux and OS X kernels report the values (Linux: seconds, OS X: milliseconds), but I can't find any references to the units used when settings the values via setsockopt(). Most references I found imply *seconds*. I think either OS X may be buggy, or the behavior of keepalives on listening sockets is just different vs. keep alive behavior on connected sockets. On OS X, when I setsockopt(sd, SOL_SOCKET, SO_KEEPALIVE, ...) to a value of 10, I see that the listener socket doesn't generate its first EAGAIN for 10 seconds. But then it generates EAGAINs constantly after that, regardless of the TCP_KEEPINTVL value that is set. Specifically, I tried setting TCP_KEEPINVTL to 1 and 1000 -- there was no change in frequency of the EAGAINs after the first one was generated. --> But that may be moot because KEEPALIVE behavior on an unconnected socket may be ... weird/undefined/whatever. However, the OS X kernel clearly reports its default keepavlie values in milliseconds, and the Linux kernel clearly reports its default values in seconds. I ran the following commands on my systems, based on information I found on http://www.gnugk.org/keepalive.html: Linux / RHEL 6.5 / 2.6.32 kernel (this is clearly in seconds): $ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 1800 $ sysctl net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_intvl = 75 Linux / Ubuntu 14.04.2 / 3.16.0 kernel (this is clearly in seconds): $ sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 7200 $ sysctl net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_intvl = 75 OS X 10.10 / Yosemite (this is clearly in microseconds): $ sysctl net.inet.tcp.keepidle net.inet.tcp.keepidle: 7200000 $ sysctl net.inet.tcp.keepintvl net.inet.tcp.keepintvl: 75000 > On May 15, 2015, at 8:42 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Did some more digging, and it turns out that Linux specifies the keep alive > time interval in seconds - and Mac (for some strange reason) uses > milliseconds. Hence the difference in behavior. > > So I could replace the current commit with one that multiplies the keep alive > interval by 1000x if we are on a Mac. However, we don't really need keep > alive at all on the Mac, so I'm wondering if we shouldn't just leave it > turned off? > > I confess I don't care either way > Ralph > > > On Thu, May 14, 2015 at 10:46 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > In the worst case, i.e. no other solution is possible, OS X can be identified > by the existence of the macro __APPLE__. There is no need to have > OPAL_HAVE_MAC. > > George. > > On Thu, May 14, 2015 at 11:12 PM, Ralph Castain <r...@open-mpi.org> wrote: > Interesting - as I said, I'll take a look. In either case, the keep alive on > the Mac is unnecessary as it is always a standalone scenario - no value in > running it. So the "fix" does no harm and just saves some useless overhead. > > > On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > I'm sorry Ralph what you proposed is not really a fix. My comment is based on > a real execution of exactly the command you provided with lldb attached to > the process. What I see is millions of > OBJ_NEW(mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly > handled. > > George. > > > On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote: > Yes - this is the fix for that issue > > > On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > Is this by any chance associated with issue 579? > > > 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>: > I'll look at the lines you cite, but that clearly isn't the problem we are > seeing here. I can verify that because the test case: > > mpirun -n 1 sleep 1000 > > does not open up any connections at all. Thus, the use-case you describe > never occurs - yet we still blow up in memory. If I simply tell the OOB not > to set keep alive, the problem goes away. > > It only happens on Mac, and we never see Mac based clusters, so turning off > keep alive on the Mac seems a pretty simple solution. > > > On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > Ralph, > > The code pushed in g8e30579 is clearly not the right solution. > > The problem starts in oob_tcp_listener.c line 742. A new > mca_oob_tcp_pending_connection_t object is allocated to store the incoming > connection. The accept few lines below fails with an error code of 0x23 which > means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, the if at > line 750 is skipped, and we reach line 763 (a "continue") with 1) a > connection not accepted, and 2) an allocated object not release. Voila! > > Freeing the pending_connection object is not the right approach either, as it > will only remove the memory leak but the process will become a CPU hog. > > Thanks, > George. > > > > > On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote: > This is an automated email from the git hooks/post-receive script. It was > generated because a ref change was pushed to the repository containing > the project "open-mpi/ompi". > > The branch, master has been updated > via 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit) > from 1488e82efd1d09c30ba46dfa00b89e623623272f (commit) > > Those revisions listed above that are new to this repository have > not appeared on any other notification email; so we list those > revisions in full, below. > > - Log ----------------------------------------------------------------- > https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef > > commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef > Author: Ralph Castain <r...@open-mpi.org> > Date: Thu May 14 18:09:13 2015 -0600 > > The Mac appears to have problems with the keepalive support - once > keepalive starts, the memory footprint soars. So disable keepalive on the Mac > > diff --git a/config/opal_check_os_flavors.m4 b/config/opal_check_os_flavors.m4 > index d1d124d..4939560 100644 > --- a/config/opal_check_os_flavors.m4 > +++ b/config/opal_check_os_flavors.m4 > @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS], > [$opal_have_solaris], > [Whether or not we have solaris]) > > + AS_IF([test "$opal_found_apple" = "yes"], > + [opal_have_mac=1], [opal_have_mac=0]) > + AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC], > + [$opal_have_mac], > + [Whether or not we are on a Mac]) > + > # check for sockaddr_in (a good sign we have TCP) > AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h]) > AC_CHECK_TYPES([struct sockaddr_in], > diff --git a/orte/mca/oob/tcp/oob_tcp_common.c > b/orte/mca/oob/tcp/oob_tcp_common.c > index a768472..e3decf2 100644 > --- a/orte/mca/oob/tcp/oob_tcp_common.c > +++ b/orte/mca/oob/tcp/oob_tcp_common.c > @@ -72,7 +72,7 @@ > /** > * Set socket buffering > */ > - > +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC > static void set_keepalive(int sd) > { > int option; > @@ -146,6 +146,7 @@ static void set_keepalive(int sd) > } > #endif // TCP_KEEPCNT > } > +#endif //SO_KEEPALIVE > > void orte_oob_tcp_set_socket_options(int sd) > { > @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd) > opal_socket_errno); > } > #endif > -#if defined(SO_KEEPALIVE) > +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC > if (0 < mca_oob_tcp_component.keepalive_time) { > set_keepalive(sd); > } > diff --git a/orte/mca/oob/tcp/oob_tcp_component.c > b/orte/mca/oob/tcp/oob_tcp_component.c > index dd1af2a..372ed4c 100644 > --- a/orte/mca/oob/tcp/oob_tcp_component.c > +++ b/orte/mca/oob/tcp/oob_tcp_component.c > @@ -404,7 +404,7 @@ static int tcp_component_register(void) > > &mca_oob_tcp_component.disable_ipv6_family); > #endif > > - > +#if !OPAL_HAVE_MAC > mca_oob_tcp_component.keepalive_time = 10; > (void)mca_base_component_var_register(component, "keepalive_time", > "Idle time in seconds before > starting to send keepalives (num <= 0 ----> disable keepalive)", > @@ -427,7 +427,8 @@ static int tcp_component_register(void) > OPAL_INFO_LVL_9, > MCA_BASE_VAR_SCOPE_READONLY, > > &mca_oob_tcp_component.keepalive_probes); > - > +#endif > + > mca_oob_tcp_component.retry_delay = 0; > (void)mca_base_component_var_register(component, "retry_delay", > "Time (in sec) to wait before > trying to connect to peer again", > > > ----------------------------------------------------------------------- > > Summary of changes: > config/opal_check_os_flavors.m4 | 6 ++++++ > orte/mca/oob/tcp/oob_tcp_common.c | 5 +++-- > orte/mca/oob/tcp/oob_tcp_component.c | 5 +++-- > 3 files changed, 12 insertions(+), 4 deletions(-) > > > hooks/post-receive > -- > open-mpi/ompi > _______________________________________________ > ompi-commits mailing list > ompi-comm...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17401.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17402.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17403.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17404.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17405.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17406.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17407.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17408.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/