I looked at this in a bit more detail this morning.

SHORT VERSION
-------------

I think that the real issue is that we shouldn't be setting KEEPALIVE on the 
listening sockets (we should only be setting these values on accepted/connected 
sockets).

I submitted a PR for this: https://github.com/open-mpi/ompi/pull/588; there are 
several commits on it, the key commit is 
https://github.com/jsquyres/ompi/commit/ece484c.

The OS X kernel reports its default keep alive values in milliseconds, and the 
Linux kernel reports its default values in seconds.  But I can't find 
definitive references to the units for setsoctopt(TCP_KEEPALIVE) (and friends) 
-- empirical testing on OS X shows that TCP_KEEPALIVE is definitely specified 
in *seconds*.

MORE DETAIL
-----------

We only need to set KEEPALIVE on the accepted/connected sockets; there's no 
point in setting KEEPALIVE -- or any of our usual socket options (e.g., RCVBUF, 
NODELAY, etc.) -- on the listening sockets because the listening sockets are 
not used to send/receive anything.

Specifically: not setting these values for the listening sockets avoids the 
memory leak that started this whole kerfuffle.

As for whether the units are in seconds vs. milliseconds: I found references to 
how the Linux and OS X kernels report the values (Linux: seconds, OS X: 
milliseconds), but I can't find any references to the units used when settings 
the values via setsockopt().  Most references I found imply *seconds*.

I think either OS X may be buggy, or the behavior of keepalives on listening 
sockets is just different vs. keep alive behavior on connected sockets.  On OS 
X, when I setsockopt(sd, SOL_SOCKET, SO_KEEPALIVE, ...) to a value of 10, I see 
that the listener socket doesn't generate its first EAGAIN for 10 seconds.  But 
then it generates EAGAINs constantly after that, regardless of the 
TCP_KEEPINTVL value that is set.  Specifically, I tried setting TCP_KEEPINVTL 
to 1 and 1000 -- there was no change in frequency of the EAGAINs after the 
first one was generated.

--> But that may be moot because KEEPALIVE behavior on an unconnected socket 
may be ... weird/undefined/whatever.

However, the OS X kernel clearly reports its default keepavlie values in 
milliseconds, and the Linux kernel clearly reports its default values in 
seconds.  I ran the following commands on my systems, based on information I 
found on http://www.gnugk.org/keepalive.html:

Linux / RHEL 6.5 / 2.6.32 kernel (this is clearly in seconds):

$ sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 1800
$ sysctl net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_intvl = 75

Linux / Ubuntu 14.04.2 / 3.16.0 kernel (this is clearly in seconds):

$ sysctl net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
$ sysctl net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_intvl = 75

OS X 10.10 / Yosemite (this is clearly in microseconds):

$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 7200000
$ sysctl net.inet.tcp.keepintvl
net.inet.tcp.keepintvl: 75000




> On May 15, 2015, at 8:42 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Did some more digging, and it turns out that Linux specifies the keep alive 
> time interval in seconds - and Mac (for some strange reason) uses 
> milliseconds. Hence the difference in behavior.
> 
> So I could replace the current commit with one that multiplies the keep alive 
> interval by 1000x if we are on a Mac. However, we don't really need keep 
> alive at all on the Mac, so I'm wondering if we shouldn't just leave it 
> turned off?
> 
> I confess I don't care either way
> Ralph
> 
> 
> On Thu, May 14, 2015 at 10:46 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> In the worst case, i.e. no other solution is possible, OS X can be identified 
> by the existence of the macro __APPLE__. There is no need to have 
> OPAL_HAVE_MAC.
> 
>   George.
> 
> On Thu, May 14, 2015 at 11:12 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Interesting - as I said, I'll take a look. In either case, the keep alive on 
> the Mac is unnecessary as it is always a standalone scenario - no value in 
> running it. So the "fix" does no harm and just saves some useless overhead.
> 
> 
> On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> I'm sorry Ralph what you proposed is not really a fix. My comment is based on 
> a real execution of exactly the command you provided with lldb attached to 
> the process. What I see is millions of 
> OBJ_NEW(mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly 
> handled.
> 
>   George.
> 
> 
> On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Yes - this is the fix for that issue
> 
> 
> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
> Is this by any chance associated with issue 579?
> 
> 
> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>:
> I'll look at the lines you cite, but that clearly isn't the problem we are 
> seeing here. I can verify that because the test case:
> 
> mpirun -n 1 sleep 1000
> 
> does not open up any connections at all. Thus, the use-case you describe 
> never occurs - yet we still blow up in memory. If I simply tell the OOB not 
> to set keep alive, the problem goes away.
> 
> It only happens on Mac, and we never see Mac based clusters, so turning off 
> keep alive on the Mac seems a pretty simple solution.
> 
> 
> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> Ralph,
> 
> The code pushed in g8e30579 is clearly not the right solution.
> 
> The problem starts in oob_tcp_listener.c line 742. A new 
> mca_oob_tcp_pending_connection_t object is allocated to store the incoming 
> connection. The accept few lines below fails with an error code of 0x23 which 
> means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, the if at 
> line 750 is skipped, and we reach line 763 (a "continue") with 1) a 
> connection not accepted, and 2) an allocated object not release. Voila!
> 
> Freeing the pending_connection object is not the right approach either, as it 
> will only remove the memory leak but the process will become a CPU hog.
> 
>   Thanks,
>     George.
> 
> 
> 
> 
> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote:
> This is an automated email from the git hooks/post-receive script. It was
> generated because a ref change was pushed to the repository containing
> the project "open-mpi/ompi".
> 
> The branch, master has been updated
>        via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
>       from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)
> 
> Those revisions listed above that are new to this repository have
> not appeared on any other notification email; so we list those
> revisions in full, below.
> 
> - Log -----------------------------------------------------------------
> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
> 
> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
> Author: Ralph Castain <r...@open-mpi.org>
> Date:   Thu May 14 18:09:13 2015 -0600
> 
>     The Mac appears to have problems with the keepalive support - once 
> keepalive starts, the memory footprint soars. So disable keepalive on the Mac
> 
> diff --git a/config/opal_check_os_flavors.m4 b/config/opal_check_os_flavors.m4
> index d1d124d..4939560 100644
> --- a/config/opal_check_os_flavors.m4
> +++ b/config/opal_check_os_flavors.m4
> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
>                         [$opal_have_solaris],
>                         [Whether or not we have solaris])
> 
> +    AS_IF([test "$opal_found_apple" = "yes"],
> +          [opal_have_mac=1], [opal_have_mac=0])
> +    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
> +                       [$opal_have_mac],
> +                       [Whether or not we are on a Mac])
> +
>      # check for sockaddr_in (a good sign we have TCP)
>      AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
>      AC_CHECK_TYPES([struct sockaddr_in],
> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c 
> b/orte/mca/oob/tcp/oob_tcp_common.c
> index a768472..e3decf2 100644
> --- a/orte/mca/oob/tcp/oob_tcp_common.c
> +++ b/orte/mca/oob/tcp/oob_tcp_common.c
> @@ -72,7 +72,7 @@
>  /**
>   * Set socket buffering
>   */
> -
> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>  static void set_keepalive(int sd)
>  {
>      int option;
> @@ -146,6 +146,7 @@ static void set_keepalive(int sd)
>      }
>  #endif  // TCP_KEEPCNT
>  }
> +#endif //SO_KEEPALIVE
> 
>  void orte_oob_tcp_set_socket_options(int sd)
>  {
> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
>                              opal_socket_errno);
>      }
>  #endif
> -#if defined(SO_KEEPALIVE)
> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>      if (0 < mca_oob_tcp_component.keepalive_time) {
>          set_keepalive(sd);
>      }
> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c 
> b/orte/mca/oob/tcp/oob_tcp_component.c
> index dd1af2a..372ed4c 100644
> --- a/orte/mca/oob/tcp/oob_tcp_component.c
> +++ b/orte/mca/oob/tcp/oob_tcp_component.c
> @@ -404,7 +404,7 @@ static int tcp_component_register(void)
>                                            
> &mca_oob_tcp_component.disable_ipv6_family);
>  #endif
> 
> -
> +#if !OPAL_HAVE_MAC
>      mca_oob_tcp_component.keepalive_time = 10;
>      (void)mca_base_component_var_register(component, "keepalive_time",
>                                            "Idle time in seconds before 
> starting to send keepalives (num <= 0 ----> disable keepalive)",
> @@ -427,7 +427,8 @@ static int tcp_component_register(void)
>                                            OPAL_INFO_LVL_9,
>                                            MCA_BASE_VAR_SCOPE_READONLY,
>                                            
> &mca_oob_tcp_component.keepalive_probes);
> -
> +#endif
> +
>      mca_oob_tcp_component.retry_delay = 0;
>      (void)mca_base_component_var_register(component, "retry_delay",
>                                            "Time (in sec) to wait before 
> trying to connect to peer again",
> 
> 
> -----------------------------------------------------------------------
> 
> Summary of changes:
>  config/opal_check_os_flavors.m4      | 6 ++++++
>  orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
>  orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> 
> hooks/post-receive
> --
> open-mpi/ompi
> _______________________________________________
> ompi-commits mailing list
> ompi-comm...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17405.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17406.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17407.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/05/17408.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to