gharris999;429267 Wrote: > I.e. a home-brewed watchdog, yes? My current hardware includes a BIOS > watchdog function and I know linux has watchdog drivers. I haven't > really figured out how to make them work, though. Have you played > around with any of these system watchdog facilities? If I knew more > about them, I might be tempted to add another command to SrvrPowerCtrl > that could "feed" a watchdog each minute...something like what the > windows PreventStandby plugin does now. > > If we were to go with just the "is idle" message via the registry, that > should avoid the sort of system deadlock as you describe. Presumably, > LightsOut has the responsibility of clearing the "is idle" flag. Strange brew, it is. Using watchdog is very simple on an always-on system. I had it running in a few minutes in the last Alix mini-server it did setup. On a system that is under power management and does suspend/resume, using it is a nightmare. But it works, in the end. On a system like this, I have 3 loops running : - the slow and bright (?) one: the status assessment loop; It is a Loop method instance of Net::Daemon. Runs, says, every 5 minutes, looks at a lot of things (AFP, NFS, SB clients, SC7, ...) and comes back with a status (keep alive, shutdown, reboot, sleep). - a faster one: listening on a TCP port for a connexion, and replying the current status. It is a Run instance of Net::Daemon (in the same code as the one above). It tries to respond very fast at any time. It serves the last know status, which may be soon obsolete if the Loop instance is near finishing a new assessment. Normally it shouldn't do anything but respond, but there are things better done at the last minute, so this loop will for example write to the RTC if the status is suspend. Runs whenever it is being called... - ... by the linux client-side watchdog program. Configured to run with RT priority, say, every 30 secs. The watchdog is C code, it does a lot of smart things and polls a device, probably /dev/watchdog. If /dev/watchdog is not being written to timely, the OS reboots. If the client-side can't write to /dev/watchdog, it reboots the machine itself (not calling shutdown, it's all coded in the client. In fact it can also sit and wait, depends on your configuration for the client.) The watchdog also launches a helper "test" program; this program in my case is a light perl script that connects to the Net::Daemon instance and executes the required action according to its response. If it times out, the watchdog client will reboot the machine. This is why within the assessment loop isn't queried directly. The required action is launched in another process, with a slight delay to allow the test script to return before the system is suspended.
Now enters ACPI wake-up. Who wants to use a watchdog on a non 24/7 system ? Well, me. Kernel-wise, I had pretty inconsistent results according to my use of the Intel ITCO hardware watchdog (I think you'll find this hidden gem in any ICH7 chip and up), or of the softdog module, linuxes software watchdog: in some cases I had to unload the module before sleep, and reload it as part of wake-up, in other cases doing so rebooted the machine (only removing a watchdog module configured with nowayout=1 should cause the kernel to reboot, but that is the theory, not my practice.) Ok, with a bit of testing, modular or inlined in the kernel, soft or hard (if available, of course), nowayout or yeswayout... you'll get an OS that will reboot when it's stuck, but won't reboot at wake up. (Unfortunately the ACPI wake process is not subject to the watchdog, so you can still freeze at wake-up, and not reboot. That is a severe limitation, considering this is probably the only moment the watchdog would be truly useful.) Now, at wake-up, the watchdog client wakes up too, and it's grumpy. First of all if the external "test" script which ran the sleep action failed to return before the system was actually suspended, then watchdog will believe there was a (long) timeout, and reboot. Hence the delay+other job used in the external test script. You can't fool the watchdog client like this. At wake-up, because some time has lapsed in real life between its last run, it will compute that the current load is enormous, and want to trigger reboot. It will also notice that a lot of time has passed without any traffic on the network interfaces, or that the interfaces are down (i.e. not up again already), and want to tigger reboot. Fortunately, before rebooting the watchdog can be configured to launch a "repair" script. Same as the test script, it has to return fairly quickly. But it can be used to defuse those false alarms caused by wake-up. My repair script looks at a flag that is set as part of the pm-suspend script, at ACPI wake-up time. If that flag was recently touched, then this is a false alarm, and the test script blocks the reboot. Otherwise it lets go and the machine reboots. I might add I have a reboot counter in /etc/init, so that if the machine reboots too frequently in a row, it restarts with watchdog+power management scripts deactivated, and keyboard leds flashing. This avoids an endless reboot loop in case a cable is unplugged that would do more harm than good. The machine is headless so the keyboard leds are a good visual hint. Gordon, if you're still reading this: in case you're interested in looking at my code, I'll gladly share it with you. -- epoch1970 ------------------------------------------------------------------------ epoch1970's Profile: http://forums.slimdevices.com/member.php?userid=16711 View this thread: http://forums.slimdevices.com/showthread.php?t=48521 _______________________________________________ plugins mailing list [email protected] http://lists.slimdevices.com/mailman/listinfo/plugins
