> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote: > > High level question: is this useful? I imagine watchdog logic for agent is > > to call sd_notify from `Slave` actor (similar to Master) so that we know > > `Slave` is still functional. I think this is the right way to detect > > hanging. > > > > The systemd library code should provide primitive to the Agent code so that > > it does not call raw `sd_notify` directly. > > Jie Yu wrote: > Scratch that. I misunderstood what Watchdog is for. But in general, i > don't get why we need that. Sound like not a very effective way to detect > hanging.
We (Twitter) currently use monit to monitor mesos, but we'd like to use the systemd watchdog to do the monitoring and detect hangs instead. Could you elaborate on why you think it is not very effective? The way I see it, there is very little extra code and performance overhead (sd_notify is a trivial operation), and we gain reliability on the Mesos side. > On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote: > > src/linux/systemd.cpp, line 207 > > <https://reviews.apache.org/r/50540/diff/3/?file=1472243#file1472243line207> > > > > Who will delete the Watchdog? I think you should use a `managed` spawn. > > See `spawn` definition. Reading the definition for a managed spawn, it looks like we do not want a managed process -- that will get garbage collected. We want the watchdog to be alive for the entire lifetime of the process. I added a comment clarifying this. > On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote: > > src/linux/systemd.cpp, line 208 > > <https://reviews.apache.org/r/50540/diff/3/?file=1472243#file1472243line208> > > > > Why create the watch dog actor if WATCHDOG_USEC is not set? this should be fixed now. > On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote: > > src/linux/systemd.hpp, line 32 > > <https://reviews.apache.org/r/50540/diff/3/?file=1472242#file1472242line32> > > > > We now does not use `using namespace` any more. Please include them one > > by one using `using xxx;` clauses. Done. > On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote: > > src/linux/systemd.cpp, line 203 > > <https://reviews.apache.org/r/50540/diff/3/?file=1472243#file1472243line203> > > > > Hum, this also could be the case where WATCHDOG_USEC is not set, right? > > The error here sounds unintuitive. > > > > I'd suggest we check env exists or not, and then parse it. > > > > ``` > > if (watchdogUsec.isNone()) { > > } else if (watchDog.isError()) { > > } > > ``` Done. > On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote: > > configure.ac, lines 615-616 > > <https://reviews.apache.org/r/50540/diff/3/?file=1472240#file1472240line615> > > > > Looks like you haven't checked headers. You need dev headers, right? I'm not too sure, but I think we just need this lib for sd_notify. I tried building this on a machine without the systemd dev headers and it worked, so I don't think dev headers are necessary. - Lawrence ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/50540/#review145876 ----------------------------------------------------------- On Aug. 18, 2016, 11:19 p.m., Lawrence Wu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/50540/ > ----------------------------------------------------------- > > (Updated Aug. 18, 2016, 11:19 p.m.) > > > Review request for mesos, David Robinson, Ian Downes, and Jie Yu. > > > Bugs: MESOS-5376 > https://issues.apache.org/jira/browse/MESOS-5376 > > > Repository: mesos > > > Description > ------- > > Add systemd watchdog support. > > > Diffs > ----- > > configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 > src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 > src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a > src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 > src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a > src/tests/linux/systemd_test_helper.hpp PRE-CREATION > src/tests/linux/systemd_test_helper.cpp PRE-CREATION > src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION > src/tests/linux/systemd_tests.cpp PRE-CREATION > > Diff: https://reviews.apache.org/r/50540/diff/ > > > Testing > ------- > > Tested by sending SIGSTOP to running mesos and verifying via journalctl that > it was killed by the watchdog. > > The test I wrote for this does the following: > - build up a unit file as a string and create a unit file in > /etc/systemd/system/systemd-test-helper.service > - reload the systemd daemon and start the newly discovered helper service > - wait a bit (30s) to make sure the watchdog has had a chance to kill the > service > - use systemctl status systemd-test-helper to check that the service is still > running > - clean up the unit file. > > > Thanks, > > Lawrence Wu > >
