Re: [SlimDevices: Plugins] Announce: Beta version of SvrPowerControl

epoch1970 Thu, 04 Jun 2009 10:36:05 -0700

gharris999;429267 Wrote: 
> I.e. a home-brewed watchdog, yes?  My current hardware includes a BIOS
> watchdog function and I know linux has watchdog drivers.  I haven't
> really figured out how to make them work, though.  Have you played
> around with any of these system watchdog facilities?  If I knew more
> about them, I might be tempted to add another command to SrvrPowerCtrl
> that could "feed" a watchdog each minute...something like what the
> windows PreventStandby plugin does now.
> 
> If we were to go with just the "is idle" message via the registry, that
> should avoid the sort of system deadlock as you describe.  Presumably,
> LightsOut has the responsibility of clearing the "is idle" flag.
Strange brew, it is. Using watchdog is very simple on an always-on
system. I had it running in a few minutes in the last Alix mini-server
it did setup. 
On a system that is under power management and does suspend/resume,
using it is a nightmare. But it works, in the end. On a system like
this, I have 3 loops running :
- the slow and bright (?) one: the status assessment loop; It is a
Loop method instance of Net::Daemon. Runs, says, every 5 minutes, looks
at a lot of things (AFP, NFS, SB clients, SC7, ...) and comes back with
a status (keep alive, shutdown, reboot, sleep).
- a faster one: listening on a TCP port for a connexion, and replying
the current status. It is a Run instance of Net::Daemon (in the same
code as the one above). It tries to respond very fast at any time. It
serves the last know status, which may be soon obsolete if the Loop
instance is near finishing a new assessment. Normally it shouldn't do
anything but respond, but there are things better done at the last
minute, so this loop will for example write to the RTC if the status is
suspend. Runs whenever it is being called...
- ... by the linux client-side watchdog program. Configured to run
with RT priority, say, every 30 secs. The watchdog is C code, it does a
lot of smart things and polls a device, probably /dev/watchdog. If
/dev/watchdog is not being written to timely, the OS reboots. If the
client-side can't write to /dev/watchdog, it reboots the machine itself
(not calling shutdown, it's all coded in the client. In fact it can also
sit and wait, depends on your configuration for the client.) The
watchdog also launches a helper "test" program; this program in my case
is a light perl script that connects to the Net::Daemon instance and
executes the required action according to its response. If it times out,
the watchdog client will reboot the machine. This is why within the
assessment loop isn't queried directly. The required action is launched
in another process, with a slight delay to allow the test script to
return before the system is suspended.


Now enters ACPI wake-up. Who wants to use a watchdog on a non 24/7
system ? Well, me.

Kernel-wise, I had pretty inconsistent results according to my use of
the Intel ITCO hardware watchdog (I think you'll find this hidden gem in
any ICH7 chip and up), or of the softdog module, linuxes software
watchdog: in some cases I had to unload the module before sleep, and
reload it as part of wake-up, in other cases doing so rebooted the
machine (only removing a watchdog module configured with nowayout=1
should cause the kernel to reboot, but that is the theory, not my
practice.) 
Ok, with a bit of testing, modular or inlined in the kernel, soft or
hard (if available, of course), nowayout or yeswayout... you'll get an
OS that will reboot when it's stuck, but won't reboot at wake up.
(Unfortunately the ACPI wake process is not subject to the watchdog, so
you can still freeze at wake-up, and not reboot. That is a severe
limitation, considering this is probably the only moment the watchdog
would be truly useful.)

Now, at wake-up, the watchdog client wakes up too, and it's grumpy.
First of all if the external "test" script which ran the sleep action
failed to return before the system was actually suspended, then watchdog
will believe there was a (long) timeout, and reboot. Hence the
delay+other job used in the external test script.
You can't fool the watchdog client like this. At wake-up, because some
time has lapsed in real life between its last run, it will compute that
the current load is enormous, and want to trigger reboot. It will also
notice that a lot of time has passed without any traffic on the network
interfaces, or that the interfaces are down (i.e. not up again already),
and want to tigger reboot.
Fortunately, before rebooting the watchdog can be configured to launch
a "repair" script. Same as the test script, it has to return fairly
quickly. But it can be used to defuse those false alarms caused by
wake-up. 
My repair script looks at a flag that is set as part of the pm-suspend
script, at ACPI wake-up time. If that flag was recently touched, then
this is a false alarm, and the test script blocks the reboot. Otherwise
it lets go and the machine reboots.

I might add I have a reboot counter in /etc/init, so that if the
machine reboots too frequently in a row, it restarts with watchdog+power
management scripts deactivated, and keyboard leds flashing. This avoids
an endless reboot loop in case a cable is unplugged that would do more
harm than good. The machine is headless so the keyboard leds are a good
visual hint.

Gordon, if you're still reading this: in case you're interested in
looking at my code, I'll gladly share it with you.


-- 
epoch1970
------------------------------------------------------------------------
epoch1970's Profile: http://forums.slimdevices.com/member.php?userid=16711
View this thread: http://forums.slimdevices.com/showthread.php?t=48521

_______________________________________________
plugins mailing list
[email protected]
http://lists.slimdevices.com/mailman/listinfo/plugins

Re: [SlimDevices: Plugins] Announce: Beta version of SvrPowerControl

Reply via email to