> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, line 480 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line480> > > > > launch nested container session returns a streaming response, how come > > you are calling `post()` helper here which expects a non-streaming response? > > > > probably one of the reasons why your test is hanging.
The `post()` helper delegates to `process::http::request()`, which takes boolean flag (`streamedResponse`). If this flag is set to `false`, libprocess will convert a PIPE (streaming) response into a BODY (non-streaming) response. This means that the `Future` returned by the helper will not be completed until the server closes the connection. Relevant links: - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/include/process/http.hpp#L953-L955 - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1191-L1197 - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1005-L1022 In these particular cases, it means that the `Future` will not be completed until the container exits, which is exactly what we need. Regarding the test hanging in Linux, some further debugging seem to indicate that test hangs on `process::Clock::settle()`, because there's a race + deadlock in `RateLimiter` that leaves a process stuck in `RUNNING`. I'll dig deeper on Monday, but here's some evidence: # Snippet from a successful test run that "leaks" a running process ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from HealthCheckTest [ RUN ] HealthCheckTest.DefaultExecutorCmdHealthCheck [...] E0127 11:18:06.607446 8665 limiter.hpp:123] !!!! LIMITER _acquire E0127 11:18:06.607471 8665 limiter.hpp:129] !!!! LIMITER There are 1 promises E0127 11:18:06.608064 8665 limiter.hpp:186] !!!! LIMITER destroy **** DEADLOCK DETECTED! **** You are waiting on process __limiter__(1419)@192.99.40.208:41947 that it is currently executing. [...] [ OK ] HealthCheckTest.DefaultExecutorCmdHealthCheck (592 ms) [----------] 1 test from HealthCheckTest (593 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (603 ms total) [ PASSED ] 1 test. Repeating all tests (iteration 474) . . . ``` # Snippet from a hung run ``` [...] E0127 11:18:06.878636 8640 cluster.cpp:358] !!! Settling the clock E0127 11:18:06.878646 8640 process.cpp:3491] !!! Attempting to acquire mutex E0127 11:18:06.878654 8640 process.cpp:3494] !!! !runq.empty()? E0127 11:18:06.878659 8640 process.cpp:3502] !!! running.load() > 0? E0127 11:18:06.878667 8640 process.cpp:3504] !!! 1 processes still running E0127 11:18:06.878676 8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state: 3 E0127 11:18:06.878684 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893) state: 3 E0127 11:18:06.878691 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894) state: 3 E0127 11:18:06.878697 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895) state: 3 E0127 11:18:06.878705 8640 process.cpp:3509] !!!! Process: __gc__ state: 3 E0127 11:18:06.878711 8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2 E0127 11:18:06.878718 8640 process.cpp:3509] !!!! Process: __processes__ state: 3 E0127 11:18:06.878725 8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3 E0127 11:18:06.878731 8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state: 3 E0127 11:18:06.878738 8640 process.cpp:3509] !!!! Process: files state: 3 E0127 11:18:06.878744 8640 process.cpp:3509] !!!! Process: help state: 3 E0127 11:18:06.878751 8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state: 3 E0127 11:18:06.878757 8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state: 3 E0127 11:18:06.878764 8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3 E0127 11:18:06.878772 8640 process.cpp:3509] !!!! Process: logging state: 3 E0127 11:18:06.878777 8640 process.cpp:3509] !!!! Process: master state: 3 E0127 11:18:06.878792 8640 process.cpp:3509] !!!! Process: metrics state: 3 E0127 11:18:06.878800 8640 process.cpp:3509] !!!! Process: profiler state: 3 E0127 11:18:06.878806 8640 process.cpp:3509] !!!! Process: registrar(474) state: 3 E0127 11:18:06.878813 8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420) state: 3 E0127 11:18:06.878820 8640 process.cpp:3509] !!!! Process: system state: 3 E0127 11:18:06.878828 8640 process.cpp:3509] !!!! Process: version state: 3 E0127 11:18:06.878834 8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3 E0127 11:18:06.878840 8640 process.cpp:3491] !!! Attempting to acquire mutex E0127 11:18:06.878846 8640 process.cpp:3494] !!! !runq.empty()? E0127 11:18:06.878852 8640 process.cpp:3502] !!! running.load() > 0? E0127 11:18:06.878859 8640 process.cpp:3504] !!! 1 processes still running E0127 11:18:06.878867 8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state: 3 E0127 11:18:06.878873 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893) state: 3 E0127 11:18:06.878880 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894) state: 3 E0127 11:18:06.878887 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895) state: 3 E0127 11:18:06.878895 8640 process.cpp:3509] !!!! Process: __gc__ state: 3 E0127 11:18:06.878901 8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2 E0127 11:18:06.878907 8640 process.cpp:3509] !!!! Process: __processes__ state: 3 E0127 11:18:06.878913 8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3 E0127 11:18:06.878921 8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state: 3 E0127 11:18:06.878927 8640 process.cpp:3509] !!!! Process: files state: 3 E0127 11:18:06.878933 8640 process.cpp:3509] !!!! Process: help state: 3 E0127 11:18:06.878940 8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state: 3 E0127 11:18:06.878947 8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state: 3 E0127 11:18:06.878953 8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3 E0127 11:18:06.878960 8640 process.cpp:3509] !!!! Process: logging state: 3 E0127 11:18:06.878967 8640 process.cpp:3509] !!!! Process: master state: 3 E0127 11:18:06.878973 8640 process.cpp:3509] !!!! Process: metrics state: 3 E0127 11:18:06.878979 8640 process.cpp:3509] !!!! Process: profiler state: 3 E0127 11:18:06.878986 8640 process.cpp:3509] !!!! Process: registrar(474) state: 3 E0127 11:18:06.878993 8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420) state: 3 E0127 11:18:06.878999 8640 process.cpp:3509] !!!! Process: system state: 3 E0127 11:18:06.879006 8640 process.cpp:3509] !!!! Process: version state: 3 E0127 11:18:06.879014 8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3 [...] (repeats ad nauseam) ``` > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, lines 481-485 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line481> > > > > The identation seems wrong? do we do it like this elsehwhere? > > > > i would've done so > > > > ``` > > .after(checkTimeout, > > defer(self(), > > ...) > > ``` This was what `clang-format` suggested, but I adopted your suggestion. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, line 531 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line531> > > > > s/has not returned/timed out/ I had taken the failure message from the "normal" cmd check code. I'm updating both. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, lines 534-536 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line534> > > > > why repair? To fail the future/check with a nice descriptive message that will later be logged. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, line 538 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line538> > > > > why not just return failure? The compiler complains if I do that, so I have to choose between: ``` .then([failure](const Option<int>& status) -> Future<Response> { return failure; }); ``` And: ``` .then([failure](const Option<int>& status) { return Future<Response>(failure); }); ``` Which one do we prefer? > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, lines 426-429 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line426> > > > > why a `.repair()` here? To fail the future/check with a nice descriptive message that will later be logged. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, line 435 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line435> > > > > const & ? This connection is passed to `nestedCommandHealthCheckTimedOut()`, which calls the non-const method `connection.disconnect()`. - Gastón ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/55901/#review163363 ----------------------------------------------------------- On Jan. 28, 2017, 12:39 p.m., Gastón Kleiman wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/55901/ > ----------------------------------------------------------- > > (Updated Jan. 28, 2017, 12:39 p.m.) > > > Review request for mesos, Alexander Rukletsov, Anand Mazumdar, haosdent > huang, and Vinod Kone. > > > Bugs: MESOS-6280 > https://issues.apache.org/jira/browse/MESOS-6280 > > > Repository: mesos > > > Description > ------- > > Added support for command health checks to the default executor. > > > Diffs > ----- > > src/checks/health_checker.hpp 95da1ff7dd6b222a93076633eb3757ec9aa43cf6 > src/checks/health_checker.cpp e70bd7936752613a4f92c70c4c61cd7cdf7c4ee5 > src/launcher/default_executor.cpp 97eee05cac8cb1f62d43e2aecc08a8e54e49eac3 > src/tests/health_check_tests.cpp 710cb66eff6c4447caa22772f0cdc97cfa582c50 > > Diff: https://reviews.apache.org/r/55901/diff/ > > > Testing > ------- > > Introduced a new test: `HealthCheckTest.DefaultExecutorCmdHealthCheck`. It > passes on Linux, but not on macOS. > > > Thanks, > > Gastón Kleiman > >
