Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04

2011-02-08 Thread Robert Hardy

On Mon, 7 Feb 2011, Albert Chu wrote:

That message points me to the real problem. Let's hope a bios update
fixes the watchdog weirdness.


Glad things are being reported correctly now.  Yeah, not sure what could
be going wrong underneath the covers in the firmware.  Hopefully a
firmware update can fix things.


Hi Al,

Just so there is a record for the mailing list archives, In my case I saw a
bug in my Supermicro 5016T-MTFB Superserver, which caused the BMC watchdog
to count down on reboot. I just upgraded the BIOS from 1.0 to 2.0 and the
system no longer does this. Thanks for your help in debugging.

Regards,
Rob

___
Freeipmi-users mailing list
Freeipmi-users@gnu.org
http://lists.gnu.org/mailman/listinfo/freeipmi-users


Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04

2011-02-07 Thread Robert Hardy

On 2011-02-01 8:48 PM, Albert Chu wrote:

Hey Robert,

The following beta release has a bmc-watchdog that has (hopefully) fixed
logging.

http://download.gluster.com/pub/freeipmi/qa-release/freeipmi-1.0.2.beta2.tar.gz

If you could check it out, that'd be great.

Al


Sorry for the delay, things got busy. I just tested a local build of 
1.0.2-beta2.


I now get this error message on the machine in question:
[Feb 07 16:44:12]: Error: watchdog timer must be stopped before running 
daemon


I haven't done anything different on this box than the other three so I 
doubt I'm running the IPMI kernel driver.


That message points me to the real problem. Let's hope a bios update 
fixes the watchdog weirdness.


Regards,
Rob



___
Freeipmi-users mailing list
Freeipmi-users@gnu.org
http://lists.gnu.org/mailman/listinfo/freeipmi-users


Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04

2011-02-01 Thread Robert Hardy
It is possible that there is a bios option which starts the watchdog 
which is enabled.

Once I get a chance, I will dig around in the BIOS and see.

I would think it would be much better behaviour on startup to do a 
equivalent to bmc-watchdog -y then start the watchdog.


Failing to start simply because the BIOS started the countdown seems 
very very bad to me especially without logging anything. You're left in 
a state where the watchdog dies quietly and the server hard reboots 
every couple of minutes.


I'm willing to test anything you send my way. The server isn't really in 
production yet but will be soon.


Ultimately I'm trying to package some better .debs for use on Ubuntu. 
The current ones are badly packaged, to the point of of being unusable.  
I've re-written the init script for Ubuntu but I'd really like to see an 
upstart based one


Rob

On 2011-02-01 12:54 PM, Albert Chu wrote:

Hey Robert,

I think I see the problem(s).  I call _err_exit(), which writes to
stderr, instead of _daemon_error_exit() which writes to the log.  That's
the error logging issue, which is secondary to the real one.

As for the real issue, I think this is being hit:

   if (timer_state == IPMI_BMC_WATCHDOG_TIMER_TIMER_STATE_RUNNING)
 _err_exit (watchdog timer must be stopped before running daemon);

For some reason, your BMC think's the watchdog is running from the
start.  You could verify w/ bmc-watchdog --get if if you don't star thte
timer.  Perhaps it's a hardware bug?

As an experiment, would you be willing to try a beta that removed this
check?  The issue is, I have no idea what the consequences of removing
this check will be on your motherboard if there's a bug in it.

Al

On Mon, 2011-01-31 at 15:11 -0800, Robert Hardy wrote:

That would be /var/log/freeipmi/bmc-watchdog.log here and nothing is
logged at startup (or after the unexpected exit) during bootup.

I've put all sorts of debugging lines in my init script for bmc-watchdog.

I finally ended up doing doing this at root:
mv /usr/sbin/bmc-watchdog /usr/sbin/bmc-watchdog.real

and then putting this in /usr/sbin/bmc-watchdog:
#!/bin/bash
strace -fFv -o /tmp/bmcstrace.log -- /usr/sbin/bmc-watchdog.real $@

At bootup the bmc-watchdog initscript does launch a process with a new
PID but it does NOT log the regular starting bmc-watchdog daemon. It
in fact logs nothing at all to /var/log/freeipmi/bmc-watchdog.log DURING
BOOT UP.

The strace above captured bmc-watchdog running at bootup and the same
process exiting here at the last few lines:

1584  semop(229383, {{0, 1, SEM_UNDO}}, 1) = 0
1584  nanosleep({0, 1000}, NULL)= 0
1584  write(2, bmc-watchdog.real: watchdog time..., 72) = -1 EBADF
(Bad file descriptor)
1584  exit_group(1) = ?

I've posted the entire strace here:
http://webcon.ca/~rhardy/bmcdrop/

Can you parse that and make any suggestions as to why it would exit
uncleanly and only on boot up?

I'm not quite sure what is going on, but it seems to be trying to write
on a bad file descriptor, getting an error and then exiting.
  From the strace, file descriptor 2 is in fact closed so that error
makes sense to me. The real question is it trying to write to FD 2?

When I restart bmc-watchdog when it gets to the same place it properly
writes the startup message on file descriptor 0 which is the log file
which was opened earlier...

2466  write(0, [Jan 31 18:03:23]: starting bmc-..., 48) = 48

I'm open to debugging suggestions too... Ideas?

Thanks for your help,
Rob

On 2011-01-28 5:37 PM, Albert Chu wrote:

Hey Robert,

That is indeed strange.  Does the bmc-watchdog log say anything? (I
can't remember the exact location, but I think it's /var/log/freeipmi/
something).

Al

On Thu, 2011-01-27 at 13:14 -0800, Robert Hardy wrote:

I'm running bmc-watchdog 0.7.15-2 under a current Ubuntu 10.04 64 bit on
several fairly new unloaded Supermicro servers.

On only one (always the same server) of four servers the bmc-watchdog
process quietly exits shortly after start up leaving the system setup for a
hard reset shortly after bootup.

The options and builds are identical on all of the servers. These are my
options: OPTIONS=-d -u 2 -p 0 -a 1 -F -P -L -S -O -i 300 -e 60

Through debugging I've confirmed on boot up:

- The init script gets run

- It launches bmc-watchdog  saves a new PID correctly in 
/var/run/bmc-watchdog.pid.

- Checking for a bmc-watchdog process in rc.local shows it isn't running and
 the timer is counting down.

- There is no shutdown message logged when the process disappears during bootup.

- There are no messages suggesting the process was killed

On shutdown the init script gets as far as removing
/var/run/bmc-watchdog.pid and seems to work fine.

If I stuff this in rc.local the bmc-watchdog starts up properly and never
seems to die again until the next reboot:
/usr/sbin/service bmc-watchdog stop
/usr/sbin/service bmc-watchdog start

All in all this is very weird behaviour. Is it possible a newer version

[Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04

2011-01-27 Thread Robert Hardy

I'm running bmc-watchdog 0.7.15-2 under a current Ubuntu 10.04 64 bit on
several fairly new unloaded Supermicro servers.

On only one (always the same server) of four servers the bmc-watchdog
process quietly exits shortly after start up leaving the system setup for a
hard reset shortly after bootup.

The options and builds are identical on all of the servers. These are my
options: OPTIONS=-d -u 2 -p 0 -a 1 -F -P -L -S -O -i 300 -e 60

Through debugging I've confirmed on boot up:

- The init script gets run

- It launches bmc-watchdog  saves a new PID correctly in 
/var/run/bmc-watchdog.pid.

- Checking for a bmc-watchdog process in rc.local shows it isn't running and
  the timer is counting down.

- There is no shutdown message logged when the process disappears during bootup.

- There are no messages suggesting the process was killed

On shutdown the init script gets as far as removing
/var/run/bmc-watchdog.pid and seems to work fine.

If I stuff this in rc.local the bmc-watchdog starts up properly and never
seems to die again until the next reboot:
/usr/sbin/service bmc-watchdog stop
/usr/sbin/service bmc-watchdog start

All in all this is very weird behaviour. Is it possible a newer version of
bmc-watchdog would address this? i.e. is this a known bug?

Any other ideas why this is happening (or how I can debug further)?

Regards,
Rob

___
Freeipmi-users mailing list
Freeipmi-users@gnu.org
http://lists.gnu.org/mailman/listinfo/freeipmi-users