Good catch.

If vote for the same behavior on OS X even if it's somewhat unnecessary. I.E., 
use keep alive, but do 1000x the value.

Sent from my phone. No type good.

On May 15, 2015, at 5:42 AM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:

Did some more digging, and it turns out that Linux specifies the keep alive 
time interval in seconds - and Mac (for some strange reason) uses milliseconds. 
Hence the difference in behavior.

So I could replace the current commit with one that multiplies the keep alive 
interval by 1000x if we are on a Mac. However, we don't really need keep alive 
at all on the Mac, so I'm wondering if we shouldn't just leave it turned off?

I confess I don't care either way
Ralph


On Thu, May 14, 2015 at 10:46 PM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
In the worst case, i.e. no other solution is possible, OS X can be identified 
by the existence of the macro __APPLE__. There is no need to have OPAL_HAVE_MAC.

  George.

On Thu, May 14, 2015 at 11:12 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
Interesting - as I said, I'll take a look. In either case, the keep alive on 
the Mac is unnecessary as it is always a standalone scenario - no value in 
running it. So the "fix" does no harm and just saves some useless overhead.


On Thu, May 14, 2015 at 9:00 PM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
I'm sorry Ralph what you proposed is not really a fix. My comment is based on a 
real execution of exactly the command you provided with lldb attached to the 
process. What I see is millions of OBJ_NEW(mca_oob_tcp_pending_connection_t) 
because the EAGAIN is not correctly handled.

  George.


On Thu, May 14, 2015 at 10:56 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
Yes - this is the fix for that issue


On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard 
<hpprit...@gmail.com<mailto:hpprit...@gmail.com>> wrote:
Is this by any chance associated with issue 579?


2015-05-14 20:49 GMT-06:00 Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>>:
I'll look at the lines you cite, but that clearly isn't the problem we are 
seeing here. I can verify that because the test case:

mpirun -n 1 sleep 1000

does not open up any connections at all. Thus, the use-case you describe never 
occurs - yet we still blow up in memory. If I simply tell the OOB not to set 
keep alive, the problem goes away.

It only happens on Mac, and we never see Mac based clusters, so turning off 
keep alive on the Mac seems a pretty simple solution.


On Thu, May 14, 2015 at 8:43 PM, George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
Ralph,

The code pushed in g8e30579 is clearly not the right solution.

The problem starts in oob_tcp_listener.c line 742. A new 
mca_oob_tcp_pending_connection_t object is allocated to store the incoming 
connection. The accept few lines below fails with an error code of 0x23 which 
means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, the if at 
line 750 is skipped, and we reach line 763 (a "continue") with 1) a connection 
not accepted, and 2) an allocated object not release. Voila!

Freeing the pending_connection object is not the right approach either, as it 
will only remove the memory leak but the process will become a CPU hog.

  Thanks,
    George.




On Thu, May 14, 2015 at 8:10 PM, 
<git...@crest.iu.edu<mailto:git...@crest.iu.edu>> wrote:
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "open-mpi/ompi".

The branch, master has been updated
       via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
      from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef

commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
Author: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>
List-Post: devel@lists.open-mpi.org
Date:   Thu May 14 18:09:13 2015 -0600

    The Mac appears to have problems with the keepalive support - once 
keepalive starts, the memory footprint soars. So disable keepalive on the Mac

diff --git a/config/opal_check_os_flavors.m4 b/config/opal_check_os_flavors.m4
index d1d124d..4939560 100644
--- a/config/opal_check_os_flavors.m4
+++ b/config/opal_check_os_flavors.m4
@@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
                        [$opal_have_solaris],
                        [Whether or not we have solaris])

+    AS_IF([test "$opal_found_apple" = "yes"],
+          [opal_have_mac=1], [opal_have_mac=0])
+    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
+                       [$opal_have_mac],
+                       [Whether or not we are on a Mac])
+
     # check for sockaddr_in (a good sign we have TCP)
     AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
     AC_CHECK_TYPES([struct sockaddr_in],
diff --git a/orte/mca/oob/tcp/oob_tcp_common.c 
b/orte/mca/oob/tcp/oob_tcp_common.c
index a768472..e3decf2 100644
--- a/orte/mca/oob/tcp/oob_tcp_common.c
+++ b/orte/mca/oob/tcp/oob_tcp_common.c
@@ -72,7 +72,7 @@
 /**
  * Set socket buffering
  */
-
+#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
 static void set_keepalive(int sd)
 {
     int option;
@@ -146,6 +146,7 @@ static void set_keepalive(int sd)
     }
 #endif  // TCP_KEEPCNT
 }
+#endif //SO_KEEPALIVE

 void orte_oob_tcp_set_socket_options(int sd)
 {
@@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
                             opal_socket_errno);
     }
 #endif
-#if defined(SO_KEEPALIVE)
+#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
     if (0 < mca_oob_tcp_component.keepalive_time) {
         set_keepalive(sd);
     }
diff --git a/orte/mca/oob/tcp/oob_tcp_component.c 
b/orte/mca/oob/tcp/oob_tcp_component.c
index dd1af2a..372ed4c 100644
--- a/orte/mca/oob/tcp/oob_tcp_component.c
+++ b/orte/mca/oob/tcp/oob_tcp_component.c
@@ -404,7 +404,7 @@ static int tcp_component_register(void)
                                           
&mca_oob_tcp_component.disable_ipv6_family);
 #endif

-
+#if !OPAL_HAVE_MAC
     mca_oob_tcp_component.keepalive_time = 10;
     (void)mca_base_component_var_register(component, "keepalive_time",
                                           "Idle time in seconds before 
starting to send keepalives (num <= 0 ----> disable keepalive)",
@@ -427,7 +427,8 @@ static int tcp_component_register(void)
                                           OPAL_INFO_LVL_9,
                                           MCA_BASE_VAR_SCOPE_READONLY,
                                           
&mca_oob_tcp_component.keepalive_probes);
-
+#endif
+
     mca_oob_tcp_component.retry_delay = 0;
     (void)mca_base_component_var_register(component, "retry_delay",
                                           "Time (in sec) to wait before trying 
to connect to peer again",


-----------------------------------------------------------------------

Summary of changes:
 config/opal_check_os_flavors.m4      | 6 ++++++
 orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
 orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
 3 files changed, 12 insertions(+), 4 deletions(-)


hooks/post-receive
--
open-mpi/ompi
_______________________________________________
ompi-commits mailing list
ompi-comm...@open-mpi.org<mailto:ompi-comm...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17401.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17402.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17403.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17404.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17405.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17406.php


_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17407.php

_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/05/17408.php

Reply via email to