> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > High level question: is this useful? I imagine watchdog logic for agent is 
> > to call sd_notify from `Slave` actor (similar to Master) so that we know 
> > `Slave` is still functional. I think this is the right way to detect 
> > hanging.
> > 
> > The systemd library code should provide primitive to the Agent code so that 
> > it does not call raw `sd_notify` directly.
> 
> Jie Yu wrote:
>     Scratch that. I misunderstood what Watchdog is for. But in general, i 
> don't get why we need that. Sound like not a very effective way to detect 
> hanging.
> 
> Lawrence Wu wrote:
>     We (Twitter) currently use monit to monitor mesos, but we'd like to use 
> the systemd watchdog to do the monitoring and detect hangs instead. Could you 
> elaborate on why you think it is not very effective? The way I see it, there 
> is very little extra code and performance overhead (sd_notify is a trivial 
> operation), and we gain reliability on the Mesos side.

For instance, the Slave actor is doing a LOG. Say the disk is slow and it takes 
a long time. Will this watchdog detect the hang?


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review145876
-----------------------------------------------------------


On Aug. 19, 2016, 12:09 a.m., Lawrence Wu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> -----------------------------------------------------------
> 
> (Updated Aug. 19, 2016, 12:09 a.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
>     https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -----
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/50540/diff/
> 
> 
> Testing
> -------
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit file.
> 
> 
> Thanks,
> 
> Lawrence Wu
> 
>

Reply via email to