Hey Robert, The following beta release has a bmc-watchdog that has (hopefully) fixed logging.
http://download.gluster.com/pub/freeipmi/qa-release/freeipmi-1.0.2.beta2.tar.gz If you could check it out, that'd be great. Al On Tue, 2011-02-01 at 17:20 -0800, Albert Chu wrote: > Hi Robert, > > On Tue, 2011-02-01 at 11:40 -0800, Robert Hardy wrote: > > It is possible that there is a bios option which starts the watchdog > > which is enabled. > > Once I get a chance, I will dig around in the BIOS and see. > > I think a more likely scenario would be the IPMI kernel driver is > starting up the watchdog and racing w/ the FreeIPMI one. Are you > loading the IPMI kernel driver? > > > I would think it would be much better behaviour on startup to do a > > equivalent to bmc-watchdog -y then start the watchdog. > > I had to look this up (b/c I couldn't remember, but was fairly certain) > the IPMI spec indicates that the watchdog timer is required to be turned > off when a node is rebooted (27.1). > > > Failing to start simply because the BIOS started the countdown seems > > very very bad to me especially without logging anything. > > The logging portion of this issue should be fixed w/ the next release. > > > You're left in > > a state where the watchdog dies quietly and the server hard reboots > > every couple of minutes. > > If the BIOS happens to be starting the countdown, that's *REALLY* bad on > the part of the BIOS programmers. Whoever starts the countdown needs to > manage it. It can't be trusted for some other random piece of software > to handle. > > So just so I understand the situation correctly, when you disable the > bmc-watchdog daemon, does the problem go away? The FreeIPMI > bmc-watchdog does not start any timer until it determines the timer is > stopped. Since the timer is already running, it never starts it. > > Al > > > > I'm willing to test anything you send my way. The server isn't really in > > production yet but will be soon. > > > > Ultimately I'm trying to package some better .debs for use on Ubuntu. > > The current ones are badly packaged, to the point of of being unusable. > > I've re-written the init script for Ubuntu but I'd really like to see an > > upstart based one.... > > > > Rob > > > > On 2011-02-01 12:54 PM, Albert Chu wrote: > > > Hey Robert, > > > > > > I think I see the problem(s). I call _err_exit(), which writes to > > > stderr, instead of _daemon_error_exit() which writes to the log. That's > > > the error logging issue, which is secondary to the real one. > > > > > > As for the real issue, I think this is being hit: > > > > > > if (timer_state == IPMI_BMC_WATCHDOG_TIMER_TIMER_STATE_RUNNING) > > > _err_exit ("watchdog timer must be stopped before running daemon"); > > > > > > For some reason, your BMC think's the watchdog is running from the > > > start. You could verify w/ bmc-watchdog --get if if you don't star thte > > > timer. Perhaps it's a hardware bug? > > > > > > As an experiment, would you be willing to try a beta that removed this > > > check? The issue is, I have no idea what the consequences of removing > > > this check will be on your motherboard if there's a bug in it. > > > > > > Al > > > > > > On Mon, 2011-01-31 at 15:11 -0800, Robert Hardy wrote: > > >> That would be /var/log/freeipmi/bmc-watchdog.log here and nothing is > > >> logged at startup (or after the unexpected exit) during bootup. > > >> > > >> I've put all sorts of debugging lines in my init script for bmc-watchdog. > > >> > > >> I finally ended up doing doing this at root: > > >> mv /usr/sbin/bmc-watchdog /usr/sbin/bmc-watchdog.real > > >> > > >> and then putting this in /usr/sbin/bmc-watchdog: > > >> #!/bin/bash > > >> strace -fFv -o /tmp/bmcstrace.log -- /usr/sbin/bmc-watchdog.real $@ > > >> > > >> At bootup the bmc-watchdog initscript does launch a process with a new > > >> PID but it does NOT log the regular "starting bmc-watchdog daemon". It > > >> in fact logs nothing at all to /var/log/freeipmi/bmc-watchdog.log DURING > > >> BOOT UP. > > >> > > >> The strace above captured bmc-watchdog running at bootup and the same > > >> process exiting here at the last few lines: > > >> > > >> 1584 semop(229383, {{0, 1, SEM_UNDO}}, 1) = 0 > > >> 1584 nanosleep({0, 1000}, NULL) = 0 > > >> 1584 write(2, "bmc-watchdog.real: watchdog time"..., 72) = -1 EBADF > > >> (Bad file descriptor) > > >> 1584 exit_group(1) = ? > > >> > > >> I've posted the entire strace here: > > >> http://webcon.ca/~rhardy/bmcdrop/ > > >> > > >> Can you parse that and make any suggestions as to why it would exit > > >> uncleanly and only on boot up? > > >> > > >> I'm not quite sure what is going on, but it seems to be trying to write > > >> on a bad file descriptor, getting an error and then exiting. > > >> From the strace, file descriptor 2 is in fact closed so that error > > >> makes sense to me. The real question is it trying to write to FD 2? > > >> > > >> When I restart bmc-watchdog when it gets to the same place it properly > > >> writes the startup message on file descriptor 0 which is the log file > > >> which was opened earlier... > > >> > > >> 2466 write(0, "[Jan 31 18:03:23]: starting bmc-"..., 48) = 48 > > >> > > >> I'm open to debugging suggestions too... Ideas? > > >> > > >> Thanks for your help, > > >> Rob > > >> > > >> On 2011-01-28 5:37 PM, Albert Chu wrote: > > >>> Hey Robert, > > >>> > > >>> That is indeed strange. Does the bmc-watchdog log say anything? (I > > >>> can't remember the exact location, but I think it's /var/log/freeipmi/ > > >>> something). > > >>> > > >>> Al > > >>> > > >>> On Thu, 2011-01-27 at 13:14 -0800, Robert Hardy wrote: > > >>>> I'm running bmc-watchdog 0.7.15-2 under a current Ubuntu 10.04 64 bit > > >>>> on > > >>>> several fairly new unloaded Supermicro servers. > > >>>> > > >>>> On only one (always the same server) of four servers the bmc-watchdog > > >>>> process quietly exits shortly after start up leaving the system setup > > >>>> for a > > >>>> hard reset shortly after bootup. > > >>>> > > >>>> The options and builds are identical on all of the servers. These are > > >>>> my > > >>>> options: OPTIONS="-d -u 2 -p 0 -a 1 -F -P -L -S -O -i 300 -e 60" > > >>>> > > >>>> Through debugging I've confirmed on boot up: > > >>>> > > >>>> - The init script gets run > > >>>> > > >>>> - It launches bmc-watchdog saves a new PID correctly in > > >>>> /var/run/bmc-watchdog.pid. > > >>>> > > >>>> - Checking for a bmc-watchdog process in rc.local shows it isn't > > >>>> running and > > >>>> the timer is counting down. > > >>>> > > >>>> - There is no shutdown message logged when the process disappears > > >>>> during bootup. > > >>>> > > >>>> - There are no messages suggesting the process was killed > > >>>> > > >>>> On shutdown the init script gets as far as removing > > >>>> /var/run/bmc-watchdog.pid and seems to work fine. > > >>>> > > >>>> If I stuff this in rc.local the bmc-watchdog starts up properly and > > >>>> never > > >>>> seems to die again until the next reboot: > > >>>> /usr/sbin/service bmc-watchdog stop > > >>>> /usr/sbin/service bmc-watchdog start > > >>>> > > >>>> All in all this is very weird behaviour. Is it possible a newer > > >>>> version of > > >>>> bmc-watchdog would address this? i.e. is this a known bug? > > >>>> > > >>>> Any other ideas why this is happening (or how I can debug further)? > > >>>> > > >>>> Regards, > > >>>> Rob > > >>>> > > >>>> _______________________________________________ > > >>>> Freeipmi-users mailing list > > >>>> [email protected] > > >>>> http://lists.gnu.org/mailman/listinfo/freeipmi-users > > -- Albert Chu [email protected] Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ Freeipmi-users mailing list [email protected] http://lists.gnu.org/mailman/listinfo/freeipmi-users
