> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, lines 426-429 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line426> > > > > why a `.repair()` here? > > Gastón Kleiman wrote: > To fail the future/check with a nice descriptive message that will later > be logged.
The message returned by `http::connect` should be good enough? Do we use this pattern elsewhere esp. with connect? Most uses of repair I have seen in the code base, transform the future not really to add extra logging information. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.hpp, line 137 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613997#file1613997line137> > > > > we don't typically do `const` for POD types. > > > > s/_agentSpawnsCommandContainer/ > > Gastón Kleiman wrote: > Looks lke the regex was truncated ;-). How about "localCommandCheck" (vs "remoteCommandCheck")? > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, line 480 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line480> > > > > launch nested container session returns a streaming response, how come > > you are calling `post()` helper here which expects a non-streaming response? > > > > probably one of the reasons why your test is hanging. > > Gastón Kleiman wrote: > The `post()` helper delegates to `process::http::request()`, which takes > boolean flag (`streamedResponse`). If this flag is set to `false`, libprocess > will convert a PIPE (streaming) response into a BODY (non-streaming) > response. This means that the `Future` returned by the helper will not be > completed until the server closes the connection. > > Relevant links: > - > https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/include/process/http.hpp#L953-L955 > - > https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1191-L1197 > - > https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1005-L1022 > > In these particular cases, it means that the `Future` will not be > completed until the container exits, which is exactly what we need. > > > Regarding the test hanging in Linux, some further debugging seem to > indicate that test hangs on `process::Clock::settle()`, because there's a > race + deadlock in `RateLimiter` that leaves a process stuck in `RUNNING`. > I'll dig deeper on Monday, but here's some evidence: > > # Snippet from a successful test run that "leaks" a running process > > ``` > [==========] Running 1 test from 1 test case. > [----------] Global test environment set-up. > [----------] 1 test from HealthCheckTest > [ RUN ] HealthCheckTest.DefaultExecutorCmdHealthCheck > > [...] > > E0127 11:18:06.607446 8665 limiter.hpp:123] !!!! LIMITER _acquire > E0127 11:18:06.607471 8665 limiter.hpp:129] !!!! LIMITER There are 1 > promises > E0127 11:18:06.608064 8665 limiter.hpp:186] !!!! LIMITER destroy > > **** DEADLOCK DETECTED! **** > You are waiting on process __limiter__(1419)@192.99.40.208:41947 that it > is currently executing. > > [...] > > [ OK ] HealthCheckTest.DefaultExecutorCmdHealthCheck (592 ms) > [----------] 1 test from HealthCheckTest (593 ms total) > > [----------] Global test environment tear-down > [==========] 1 test from 1 test case ran. (603 ms total) > [ PASSED ] 1 test. > > Repeating all tests (iteration 474) . . . > ``` > > # Snippet from a hung run > > ``` > [...] > E0127 11:18:06.878636 8640 cluster.cpp:358] !!! Settling the clock > E0127 11:18:06.878646 8640 process.cpp:3491] !!! Attempting to acquire > mutex > E0127 11:18:06.878654 8640 process.cpp:3494] !!! !runq.empty()? > E0127 11:18:06.878659 8640 process.cpp:3502] !!! running.load() > 0? > E0127 11:18:06.878667 8640 process.cpp:3504] !!! 1 processes still > running > E0127 11:18:06.878676 8640 process.cpp:3509] !!!! Process: > __authentication_router__(1) state: 3 > E0127 11:18:06.878684 8640 process.cpp:3509] !!!! Process: > __basic_authenticator__(1893) state: 3 > E0127 11:18:06.878691 8640 process.cpp:3509] !!!! Process: > __basic_authenticator__(1894) state: 3 > E0127 11:18:06.878697 8640 process.cpp:3509] !!!! Process: > __basic_authenticator__(1895) state: 3 > E0127 11:18:06.878705 8640 process.cpp:3509] !!!! Process: __gc__ state: > 3 > E0127 11:18:06.878711 8640 process.cpp:3509] !!!! Process: > __limiter__(1419) state: 2 > E0127 11:18:06.878718 8640 process.cpp:3509] !!!! Process: __processes__ > state: 3 > E0127 11:18:06.878725 8640 process.cpp:3509] !!!! Process: __reaper__(1) > state: 3 > E0127 11:18:06.878731 8640 process.cpp:3509] !!!! Process: > crammd5-authenticator(474) state: 3 > E0127 11:18:06.878738 8640 process.cpp:3509] !!!! Process: files state: 3 > E0127 11:18:06.878744 8640 process.cpp:3509] !!!! Process: help state: 3 > E0127 11:18:06.878751 8640 process.cpp:3509] !!!! Process: > hierarchical-allocator(474) state: 3 > E0127 11:18:06.878757 8640 process.cpp:3509] !!!! Process: > in-memory-storage(474) state: 3 > E0127 11:18:06.878764 8640 process.cpp:3509] !!!! Process: > local-authorizer(947) state: 3 > E0127 11:18:06.878772 8640 process.cpp:3509] !!!! Process: logging > state: 3 > E0127 11:18:06.878777 8640 process.cpp:3509] !!!! Process: master state: > 3 > E0127 11:18:06.878792 8640 process.cpp:3509] !!!! Process: metrics > state: 3 > E0127 11:18:06.878800 8640 process.cpp:3509] !!!! Process: profiler > state: 3 > E0127 11:18:06.878806 8640 process.cpp:3509] !!!! Process: > registrar(474) state: 3 > E0127 11:18:06.878813 8640 process.cpp:3509] !!!! Process: > standalone-master-detector(1420) state: 3 > E0127 11:18:06.878820 8640 process.cpp:3509] !!!! Process: system state: > 3 > E0127 11:18:06.878828 8640 process.cpp:3509] !!!! Process: version > state: 3 > E0127 11:18:06.878834 8640 process.cpp:3509] !!!! Process: > whitelist(474) state: 3 > E0127 11:18:06.878840 8640 process.cpp:3491] !!! Attempting to acquire > mutex > E0127 11:18:06.878846 8640 process.cpp:3494] !!! !runq.empty()? > E0127 11:18:06.878852 8640 process.cpp:3502] !!! running.load() > 0? > E0127 11:18:06.878859 8640 process.cpp:3504] !!! 1 processes still > running > E0127 11:18:06.878867 8640 process.cpp:3509] !!!! Process: > __authentication_router__(1) state: 3 > E0127 11:18:06.878873 8640 process.cpp:3509] !!!! Process: > __basic_authenticator__(1893) state: 3 > E0127 11:18:06.878880 8640 process.cpp:3509] !!!! Process: > __basic_authenticator__(1894) state: 3 > E0127 11:18:06.878887 8640 process.cpp:3509] !!!! Process: > __basic_authenticator__(1895) state: 3 > E0127 11:18:06.878895 8640 process.cpp:3509] !!!! Process: __gc__ state: > 3 > E0127 11:18:06.878901 8640 process.cpp:3509] !!!! Process: > __limiter__(1419) state: 2 > E0127 11:18:06.878907 8640 process.cpp:3509] !!!! Process: __processes__ > state: 3 > E0127 11:18:06.878913 8640 process.cpp:3509] !!!! Process: __reaper__(1) > state: 3 > E0127 11:18:06.878921 8640 process.cpp:3509] !!!! Process: > crammd5-authenticator(474) state: 3 > E0127 11:18:06.878927 8640 process.cpp:3509] !!!! Process: files state: 3 > E0127 11:18:06.878933 8640 process.cpp:3509] !!!! Process: help state: 3 > E0127 11:18:06.878940 8640 process.cpp:3509] !!!! Process: > hierarchical-allocator(474) state: 3 > E0127 11:18:06.878947 8640 process.cpp:3509] !!!! Process: > in-memory-storage(474) state: 3 > E0127 11:18:06.878953 8640 process.cpp:3509] !!!! Process: > local-authorizer(947) state: 3 > E0127 11:18:06.878960 8640 process.cpp:3509] !!!! Process: logging > state: 3 > E0127 11:18:06.878967 8640 process.cpp:3509] !!!! Process: master state: > 3 > E0127 11:18:06.878973 8640 process.cpp:3509] !!!! Process: metrics > state: 3 > E0127 11:18:06.878979 8640 process.cpp:3509] !!!! Process: profiler > state: 3 > E0127 11:18:06.878986 8640 process.cpp:3509] !!!! Process: > registrar(474) state: 3 > E0127 11:18:06.878993 8640 process.cpp:3509] !!!! Process: > standalone-master-detector(1420) state: 3 > E0127 11:18:06.878999 8640 process.cpp:3509] !!!! Process: system state: > 3 > E0127 11:18:06.879006 8640 process.cpp:3509] !!!! Process: version > state: 3 > E0127 11:18:06.879014 8640 process.cpp:3509] !!!! Process: > whitelist(474) state: 3 > [...] (repeats ad nauseam) > ``` > > Gastón Kleiman wrote: > See https://issues.apache.org/jira/browse/MESOS-7036 for an in-depth > analysis of the deadlock. This needs a giant comment then to explain the rationale for posterity. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, line 538 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line538> > > > > why not just return failure? > > Gastón Kleiman wrote: > The compiler complains if I do that, so I have to choose between: > > ``` > .then([failure](const Option<int>& status) -> Future<Response> { > return failure; > }); > ``` > > And: > > ``` > .then([failure](const Option<int>& status) { > return Future<Response>(failure); > }); > ``` > > Which one do we prefer? whichever one you pick I would like to see both `.repair` and `.then` be consistent if possible. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, line 435 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line435> > > > > const & ? > > Gastón Kleiman wrote: > This connection is passed to `nestedCommandHealthCheckTimedOut()`, which > calls the non-const method `connection.disconnect()`. I see. > On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote: > > src/checks/health_checker.cpp, lines 534-536 > > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line534> > > > > why repair? > > Gastón Kleiman wrote: > To fail the future/check with a nice descriptive message that will later > be logged. see above. - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/55901/#review163363 ----------------------------------------------------------- On Jan. 28, 2017, 12:39 p.m., Gastón Kleiman wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/55901/ > ----------------------------------------------------------- > > (Updated Jan. 28, 2017, 12:39 p.m.) > > > Review request for mesos, Alexander Rukletsov, Anand Mazumdar, haosdent > huang, and Vinod Kone. > > > Bugs: MESOS-6280 > https://issues.apache.org/jira/browse/MESOS-6280 > > > Repository: mesos > > > Description > ------- > > Added support for command health checks to the default executor. > > > Diffs > ----- > > src/checks/health_checker.hpp 95da1ff7dd6b222a93076633eb3757ec9aa43cf6 > src/checks/health_checker.cpp e70bd7936752613a4f92c70c4c61cd7cdf7c4ee5 > src/launcher/default_executor.cpp 97eee05cac8cb1f62d43e2aecc08a8e54e49eac3 > src/tests/health_check_tests.cpp 710cb66eff6c4447caa22772f0cdc97cfa582c50 > > Diff: https://reviews.apache.org/r/55901/diff/ > > > Testing > ------- > > Introduced a new test: `HealthCheckTest.DefaultExecutorCmdHealthCheck`. It > passes on Linux, but not on macOS. > > > Thanks, > > Gastón Kleiman > >
