Re: [Openais] kernel watchdog timer for corosync

Angus Salkeld Mon, 24 May 2010 17:34:52 -0700

On Mon, May 24, 2010 at 05:06:12PM -0700, Alan Jones wrote:
> Is there any interest in a kernel watchdog timer for corosync and, if so,
> where should it be petted?
> I did a simple test of killing the corosync 1.2.1 daemon in a pacemaker
> configuration with shared storage.
> Sure enough, the node is declared offline which presents a potential for
> corruption.
> I know that a stonith device should protect you, but it seems to me that a
> watchdog timer would add another layer of protection.
> A traditional place to pet the watchdog might be in the receive path,
> assuming that there is some loopback message transmitted in regular
> intervals.
> Alan
Hi


Jan Friesse and I are busy implementing watchdog functionality for corosync.

(This is for people that don't want a large cluster stack - just corosync).
There are 2 new services:
mon - monitoring
sf - self fencing (watchdog & reboot, ...)
and some changes to the SAM lib.

It'll go something like this:

Lets say you have some processes to monitor:
apple
bannana

1) You want to restart apple if anything goes wrong
2) You want the host to watchdog is bannana fails
3) And if memory usage gets to 95% you want to reboot
   to prevent oom killer.
4) if corosync dies you want to watchdog

so in corosync.conf we have:
resources {
        watchdog_timeout: 4
        system {
                memory_used {
                        recovery: reboot
                        max: 80
                        poll_period: <how often it is updated>
                }
        }
}

Then apple and bannana start sam and tell sam what behaviour
they want (restart/reboot/watchdog).

sam creates the following entries in the objdb.

resources {
        processes {
                <bannana:pid {
                        recovery: watchdog
                        state: good/bad
                        last_updated: <timestamp>
                        poll_period: <how often it is updated>
                }
        }
}

the sf service will then run the recovery action if
(state == bad) or ((last_updated + poll_period) < current time)

It is the responibility of the poller (sam and monitor service) to
update the entry in time.

And since corosync is tickling the watchdog when it dies we get a watchdog.


----
Another avenue I am looking at is chatting to the watchdog guys to get
an IPC interface to watchdog. So that we don't have to re-do what
watchdogd does well (loading kernel mods and tickling the watchdog).

This way multiple daemons could effectively tickle the watchdog
(at different timeouts if needed) and have a simpler API as well.


-Angus

> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] kernel watchdog timer for corosync

Reply via email to