>Just wondering if there's any consensus on the use of >watchdogs. I just locked up my box playing with yet another >fun freshly-downloaded soundapp from sourceforge... >(amsynth in this case, but I've seen lockups many many times >before). Let's face it, it ain't hard to freeze the box and >kill the keyboard when you're running as root with realtime >priority. The "Magic SysReq" keys are useless if you get >no response from the keyboard... > >It's not as scary as it used to be now that I use journalling >filesystems, so I can reboot real quick and not corrupt >my data. But still, it sucks to have to power down to get >out of anything. > >So, what do we do? Watchdog? In hardware or software? >Seems intuitive to me that a hardware watchdog would be the >most reliable, but I haven't looked into it. What's a good >one, and what do they cost?
the kernel uses the h/w watchdog timer already, but its not made available for user space - it just catches kernel lockups. SCHED_FIFO lockups are different, and can't be handled by the kernel, since they are not "an error" - its just an application with "a lot of work to do and permission to take as long as it needs". jack has just this morning seen the addition of a watchdog thread that runs SCHED_FIFO and at higher priority than the rest of a jack system. as long as the kernel is still running, the watchdog can kill any SCHED_FIFO runaway within jack. it checks every 5 seconds to make sure that progress is being made ... this idea has been discussed here quite a bit. however, its much, much harder to do this *between* different processes. again, there is no way to tell that a SCHED_FIFO thread has gone wrong from "its just very busy", and you can't even really identify either of these conditions. the only type of thread that could (a SCHED_FIFO thread with higher priority) will run anyway, regardless of the fact that the rest of the system appears to have locked up. BTW, my impression is that if magic sysreq doesn't work, you've got more than a SCHED_FIFO hang - you've got a full-scale kernel panic or deadlock. IMHO, app writers should be using their own watchdogs if they allow SCHED_FIFO. And of course i have to add that jack will take care of all this for you :) --p
