Did some more digging, and it turns out that Linux specifies the keep alive time interval in seconds - and Mac (for some strange reason) uses milliseconds. Hence the difference in behavior.
So I could replace the current commit with one that multiplies the keep alive interval by 1000x if we are on a Mac. However, we don't really need keep alive at all on the Mac, so I'm wondering if we shouldn't just leave it turned off? I confess I don't care either way Ralph On Thu, May 14, 2015 at 10:46 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > In the worst case, i.e. no other solution is possible, OS X can be > identified by the existence of the macro __APPLE__. There is no need to > have OPAL_HAVE_MAC. > > George. > > On Thu, May 14, 2015 at 11:12 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Interesting - as I said, I'll take a look. In either case, the keep alive >> on the Mac is unnecessary as it is always a standalone scenario - no value >> in running it. So the "fix" does no harm and just saves some useless >> overhead. >> >> >> On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu> >> wrote: >> >>> I'm sorry Ralph what you proposed is not really a fix. My comment is >>> based on a real execution of exactly the command you provided with lldb >>> attached to the process. What I see is millions of OBJ_NEW( >>> mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly >>> handled. >>> >>> George. >>> >>> >>> On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> >>> wrote: >>> >>>> Yes - this is the fix for that issue >>>> >>>> >>>> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com> >>>> wrote: >>>> >>>>> Is this by any chance associated with issue 579? >>>>> >>>>> >>>>> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>: >>>>> >>>>>> I'll look at the lines you cite, but that clearly isn't the problem >>>>>> we are seeing here. I can verify that because the test case: >>>>>> >>>>>> mpirun -n 1 sleep 1000 >>>>>> >>>>>> does not open up any connections at all. Thus, the use-case you >>>>>> describe never occurs - yet we still blow up in memory. If I simply tell >>>>>> the OOB not to set keep alive, the problem goes away. >>>>>> >>>>>> It only happens on Mac, and we never see Mac based clusters, so >>>>>> turning off keep alive on the Mac seems a pretty simple solution. >>>>>> >>>>>> >>>>>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>> wrote: >>>>>> >>>>>>> Ralph, >>>>>>> >>>>>>> The code pushed in g8e30579 is clearly not the right solution. >>>>>>> >>>>>>> The problem starts in oob_tcp_listener.c line 742. A new >>>>>>> mca_oob_tcp_pending_connection_t object is allocated to store the >>>>>>> incoming >>>>>>> connection. The accept few lines below fails with an error code of 0x23 >>>>>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). >>>>>>> Thus, >>>>>>> the if at line 750 is skipped, and we reach line 763 (a "continue") >>>>>>> with 1) >>>>>>> a connection not accepted, and 2) an allocated object not release. >>>>>>> Voila! >>>>>>> >>>>>>> Freeing the pending_connection object is not the right approach >>>>>>> either, as it will only remove the memory leak but the process will >>>>>>> become >>>>>>> a CPU hog. >>>>>>> >>>>>>> Thanks, >>>>>>> George. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote: >>>>>>> >>>>>>>> This is an automated email from the git hooks/post-receive script. >>>>>>>> It was >>>>>>>> generated because a ref change was pushed to the repository >>>>>>>> containing >>>>>>>> the project "open-mpi/ompi". >>>>>>>> >>>>>>>> The branch, master has been updated >>>>>>>> via 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit) >>>>>>>> from 1488e82efd1d09c30ba46dfa00b89e623623272f (commit) >>>>>>>> >>>>>>>> Those revisions listed above that are new to this repository have >>>>>>>> not appeared on any other notification email; so we list those >>>>>>>> revisions in full, below. >>>>>>>> >>>>>>>> - Log >>>>>>>> ----------------------------------------------------------------- >>>>>>>> >>>>>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>>>>> >>>>>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>>>>> Author: Ralph Castain <r...@open-mpi.org> >>>>>>>> Date: Thu May 14 18:09:13 2015 -0600 >>>>>>>> >>>>>>>> The Mac appears to have problems with the keepalive support - >>>>>>>> once keepalive starts, the memory footprint soars. So disable >>>>>>>> keepalive on >>>>>>>> the Mac >>>>>>>> >>>>>>>> diff --git a/config/opal_check_os_flavors.m4 >>>>>>>> b/config/opal_check_os_flavors.m4 >>>>>>>> index d1d124d..4939560 100644 >>>>>>>> --- a/config/opal_check_os_flavors.m4 >>>>>>>> +++ b/config/opal_check_os_flavors.m4 >>>>>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS], >>>>>>>> [$opal_have_solaris], >>>>>>>> [Whether or not we have solaris]) >>>>>>>> >>>>>>>> + AS_IF([test "$opal_found_apple" = "yes"], >>>>>>>> + [opal_have_mac=1], [opal_have_mac=0]) >>>>>>>> + AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC], >>>>>>>> + [$opal_have_mac], >>>>>>>> + [Whether or not we are on a Mac]) >>>>>>>> + >>>>>>>> # check for sockaddr_in (a good sign we have TCP) >>>>>>>> AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h]) >>>>>>>> AC_CHECK_TYPES([struct sockaddr_in], >>>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>>> b/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>>> index a768472..e3decf2 100644 >>>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>>> @@ -72,7 +72,7 @@ >>>>>>>> /** >>>>>>>> * Set socket buffering >>>>>>>> */ >>>>>>>> - >>>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>>>>> static void set_keepalive(int sd) >>>>>>>> { >>>>>>>> int option; >>>>>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd) >>>>>>>> } >>>>>>>> #endif // TCP_KEEPCNT >>>>>>>> } >>>>>>>> +#endif //SO_KEEPALIVE >>>>>>>> >>>>>>>> void orte_oob_tcp_set_socket_options(int sd) >>>>>>>> { >>>>>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd) >>>>>>>> opal_socket_errno); >>>>>>>> } >>>>>>>> #endif >>>>>>>> -#if defined(SO_KEEPALIVE) >>>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>>>>> if (0 < mca_oob_tcp_component.keepalive_time) { >>>>>>>> set_keepalive(sd); >>>>>>>> } >>>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>>> b/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>>> index dd1af2a..372ed4c 100644 >>>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void) >>>>>>>> >>>>>>>> &mca_oob_tcp_component.disable_ipv6_family); >>>>>>>> #endif >>>>>>>> >>>>>>>> - >>>>>>>> +#if !OPAL_HAVE_MAC >>>>>>>> mca_oob_tcp_component.keepalive_time = 10; >>>>>>>> (void)mca_base_component_var_register(component, >>>>>>>> "keepalive_time", >>>>>>>> "Idle time in seconds >>>>>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)", >>>>>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void) >>>>>>>> OPAL_INFO_LVL_9, >>>>>>>> >>>>>>>> MCA_BASE_VAR_SCOPE_READONLY, >>>>>>>> >>>>>>>> &mca_oob_tcp_component.keepalive_probes); >>>>>>>> - >>>>>>>> +#endif >>>>>>>> + >>>>>>>> mca_oob_tcp_component.retry_delay = 0; >>>>>>>> (void)mca_base_component_var_register(component, "retry_delay", >>>>>>>> "Time (in sec) to wait >>>>>>>> before trying to connect to peer again", >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ----------------------------------------------------------------------- >>>>>>>> >>>>>>>> Summary of changes: >>>>>>>> config/opal_check_os_flavors.m4 | 6 ++++++ >>>>>>>> orte/mca/oob/tcp/oob_tcp_common.c | 5 +++-- >>>>>>>> orte/mca/oob/tcp/oob_tcp_component.c | 5 +++-- >>>>>>>> 3 files changed, 12 insertions(+), 4 deletions(-) >>>>>>>> >>>>>>>> >>>>>>>> hooks/post-receive >>>>>>>> -- >>>>>>>> open-mpi/ompi >>>>>>>> _______________________________________________ >>>>>>>> ompi-commits mailing list >>>>>>>> ompi-comm...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/05/17405.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/05/17406.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17407.php >