> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote: > > High level question: is this useful? I imagine watchdog logic for agent is > > to call sd_notify from `Slave` actor (similar to Master) so that we know > > `Slave` is still functional. I think this is the right way to detect > > hanging. > > > > The systemd library code should provide primitive to the Agent code so that > > it does not call raw `sd_notify` directly. > > Jie Yu wrote: > Scratch that. I misunderstood what Watchdog is for. But in general, i > don't get why we need that. Sound like not a very effective way to detect > hanging. > > Lawrence Wu wrote: > We (Twitter) currently use monit to monitor mesos, but we'd like to use > the systemd watchdog to do the monitoring and detect hangs instead. Could you > elaborate on why you think it is not very effective? The way I see it, there > is very little extra code and performance overhead (sd_notify is a trivial > operation), and we gain reliability on the Mesos side.
For instance, the Slave actor is doing a LOG. Say the disk is slow and it takes a long time. Will this watchdog detect the hang? - Jie ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/50540/#review145876 ----------------------------------------------------------- On Aug. 19, 2016, 12:09 a.m., Lawrence Wu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/50540/ > ----------------------------------------------------------- > > (Updated Aug. 19, 2016, 12:09 a.m.) > > > Review request for mesos, David Robinson, Ian Downes, and Jie Yu. > > > Bugs: MESOS-5376 > https://issues.apache.org/jira/browse/MESOS-5376 > > > Repository: mesos > > > Description > ------- > > Add systemd watchdog support. > > > Diffs > ----- > > configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 > src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 > src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a > src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 > src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a > src/tests/linux/systemd_test_helper.hpp PRE-CREATION > src/tests/linux/systemd_test_helper.cpp PRE-CREATION > src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION > src/tests/linux/systemd_tests.cpp PRE-CREATION > > Diff: https://reviews.apache.org/r/50540/diff/ > > > Testing > ------- > > Tested by sending SIGSTOP to running mesos and verifying via journalctl that > it was killed by the watchdog. > > The test I wrote for this does the following: > - build up a unit file as a string and create a unit file in > /etc/systemd/system/systemd-test-helper.service > - reload the systemd daemon and start the newly discovered helper service > - wait a bit (30s) to make sure the watchdog has had a chance to kill the > service > - use systemctl status systemd-test-helper to check that the service is still > running > - clean up the unit file. > > > Thanks, > > Lawrence Wu > >
