In the worst case, i.e. no other solution is possible, OS X can be identified by the existence of the macro __APPLE__. There is no need to have OPAL_HAVE_MAC.
George. On Thu, May 14, 2015 at 11:12 PM, Ralph Castain <r...@open-mpi.org> wrote: > Interesting - as I said, I'll take a look. In either case, the keep alive > on the Mac is unnecessary as it is always a standalone scenario - no value > in running it. So the "fix" does no harm and just saves some useless > overhead. > > > On Thu, May 14, 2015 at 9:00 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> I'm sorry Ralph what you proposed is not really a fix. My comment is >> based on a real execution of exactly the command you provided with lldb >> attached to the process. What I see is millions of OBJ_NEW( >> mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly >> handled. >> >> George. >> >> >> On Thu, May 14, 2015 at 10:56 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Yes - this is the fix for that issue >>> >>> >>> On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard <hpprit...@gmail.com> >>> wrote: >>> >>>> Is this by any chance associated with issue 579? >>>> >>>> >>>> 2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>: >>>> >>>>> I'll look at the lines you cite, but that clearly isn't the problem we >>>>> are seeing here. I can verify that because the test case: >>>>> >>>>> mpirun -n 1 sleep 1000 >>>>> >>>>> does not open up any connections at all. Thus, the use-case you >>>>> describe never occurs - yet we still blow up in memory. If I simply tell >>>>> the OOB not to set keep alive, the problem goes away. >>>>> >>>>> It only happens on Mac, and we never see Mac based clusters, so >>>>> turning off keep alive on the Mac seems a pretty simple solution. >>>>> >>>>> >>>>> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu> >>>>> wrote: >>>>> >>>>>> Ralph, >>>>>> >>>>>> The code pushed in g8e30579 is clearly not the right solution. >>>>>> >>>>>> The problem starts in oob_tcp_listener.c line 742. A new >>>>>> mca_oob_tcp_pending_connection_t object is allocated to store the >>>>>> incoming >>>>>> connection. The accept few lines below fails with an error code of 0x23 >>>>>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus, >>>>>> the if at line 750 is skipped, and we reach line 763 (a "continue") with >>>>>> 1) >>>>>> a connection not accepted, and 2) an allocated object not release. Voila! >>>>>> >>>>>> Freeing the pending_connection object is not the right approach >>>>>> either, as it will only remove the memory leak but the process will >>>>>> become >>>>>> a CPU hog. >>>>>> >>>>>> Thanks, >>>>>> George. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote: >>>>>> >>>>>>> This is an automated email from the git hooks/post-receive script. >>>>>>> It was >>>>>>> generated because a ref change was pushed to the repository >>>>>>> containing >>>>>>> the project "open-mpi/ompi". >>>>>>> >>>>>>> The branch, master has been updated >>>>>>> via 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit) >>>>>>> from 1488e82efd1d09c30ba46dfa00b89e623623272f (commit) >>>>>>> >>>>>>> Those revisions listed above that are new to this repository have >>>>>>> not appeared on any other notification email; so we list those >>>>>>> revisions in full, below. >>>>>>> >>>>>>> - Log >>>>>>> ----------------------------------------------------------------- >>>>>>> >>>>>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>>>> >>>>>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef >>>>>>> Author: Ralph Castain <r...@open-mpi.org> >>>>>>> Date: Thu May 14 18:09:13 2015 -0600 >>>>>>> >>>>>>> The Mac appears to have problems with the keepalive support - >>>>>>> once keepalive starts, the memory footprint soars. So disable keepalive >>>>>>> on >>>>>>> the Mac >>>>>>> >>>>>>> diff --git a/config/opal_check_os_flavors.m4 >>>>>>> b/config/opal_check_os_flavors.m4 >>>>>>> index d1d124d..4939560 100644 >>>>>>> --- a/config/opal_check_os_flavors.m4 >>>>>>> +++ b/config/opal_check_os_flavors.m4 >>>>>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS], >>>>>>> [$opal_have_solaris], >>>>>>> [Whether or not we have solaris]) >>>>>>> >>>>>>> + AS_IF([test "$opal_found_apple" = "yes"], >>>>>>> + [opal_have_mac=1], [opal_have_mac=0]) >>>>>>> + AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC], >>>>>>> + [$opal_have_mac], >>>>>>> + [Whether or not we are on a Mac]) >>>>>>> + >>>>>>> # check for sockaddr_in (a good sign we have TCP) >>>>>>> AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h]) >>>>>>> AC_CHECK_TYPES([struct sockaddr_in], >>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>> b/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>> index a768472..e3decf2 100644 >>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c >>>>>>> @@ -72,7 +72,7 @@ >>>>>>> /** >>>>>>> * Set socket buffering >>>>>>> */ >>>>>>> - >>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>>>> static void set_keepalive(int sd) >>>>>>> { >>>>>>> int option; >>>>>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd) >>>>>>> } >>>>>>> #endif // TCP_KEEPCNT >>>>>>> } >>>>>>> +#endif //SO_KEEPALIVE >>>>>>> >>>>>>> void orte_oob_tcp_set_socket_options(int sd) >>>>>>> { >>>>>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd) >>>>>>> opal_socket_errno); >>>>>>> } >>>>>>> #endif >>>>>>> -#if defined(SO_KEEPALIVE) >>>>>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC >>>>>>> if (0 < mca_oob_tcp_component.keepalive_time) { >>>>>>> set_keepalive(sd); >>>>>>> } >>>>>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>> b/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>> index dd1af2a..372ed4c 100644 >>>>>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c >>>>>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void) >>>>>>> >>>>>>> &mca_oob_tcp_component.disable_ipv6_family); >>>>>>> #endif >>>>>>> >>>>>>> - >>>>>>> +#if !OPAL_HAVE_MAC >>>>>>> mca_oob_tcp_component.keepalive_time = 10; >>>>>>> (void)mca_base_component_var_register(component, >>>>>>> "keepalive_time", >>>>>>> "Idle time in seconds >>>>>>> before starting to send keepalives (num <= 0 ----> disable keepalive)", >>>>>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void) >>>>>>> OPAL_INFO_LVL_9, >>>>>>> >>>>>>> MCA_BASE_VAR_SCOPE_READONLY, >>>>>>> >>>>>>> &mca_oob_tcp_component.keepalive_probes); >>>>>>> - >>>>>>> +#endif >>>>>>> + >>>>>>> mca_oob_tcp_component.retry_delay = 0; >>>>>>> (void)mca_base_component_var_register(component, "retry_delay", >>>>>>> "Time (in sec) to wait >>>>>>> before trying to connect to peer again", >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----------------------------------------------------------------------- >>>>>>> >>>>>>> Summary of changes: >>>>>>> config/opal_check_os_flavors.m4 | 6 ++++++ >>>>>>> orte/mca/oob/tcp/oob_tcp_common.c | 5 +++-- >>>>>>> orte/mca/oob/tcp/oob_tcp_component.c | 5 +++-- >>>>>>> 3 files changed, 12 insertions(+), 4 deletions(-) >>>>>>> >>>>>>> >>>>>>> hooks/post-receive >>>>>>> -- >>>>>>> open-mpi/ompi >>>>>>> _______________________________________________ >>>>>>> ompi-commits mailing list >>>>>>> ompi-comm...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/05/17403.php >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/05/17404.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/05/17405.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/05/17406.php >