Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module, frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2010-01-08 Thread Tim Small
Sorry - just to further clarify, this is what would happen without the 
patch...


1. wd_keepalive daemon is started early in the boot process, loads 
ipmi_watchdog and opens + starts to write to /dev/watchdog

2. watchdog init script sends TERM to wd_keepalive daemon
3. watchdog init script starts watchdog daemon without waiting for 
wd_keepalive to exit
4. watchdog daemon attempt to open /dev/watchdog.  Fails to do so 
because device is already open.

5. wd_keepalive daemon closes /dev/watchdog
6. wd_keepalive daemon exits
7. 60 seconds after 5. occurs, the machine gets hard-reset 
(ipmi_watchdog defaults to nowayout=1 and timeout=60)


A similar bug could occur during reboot/shutdown if the machine took 
more than 60 seconds to reboot (i.e. between /etc/init.d/watchdog stop 
and actual machine reboot) except the occurrences of watchdog daemon 
and wd_keepalive daemon would be reversed in the above description.


Thanks,

Tim.

--
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.  
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ

VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-12-05 Thread Michael Meskes
 Ok, some more details. It reboots shortly after starting wd_keepalive,
 early in the boot sequence. Maybe it is failing to keep up because of
 all the disk activity during boot?

Maybe. The default is that the device has to be triggered once a minute which
seems like quite a lot of time for a program to not being scheduled at all.

 After booting single-user, I was able to both start and stop watchdog
 without any issues.
 
  Could you please try with verbose on and send me the log entries?
 
 The -v option in /etc/default/watchdog did not produce any output.
 Maybe it was too early in the boot sequence?

Sorry, my bad, should have explained better. wd_keepalive does not react on
most options. To get more information you have to start watchdog itself. The
only reason for it to start late is that it can be used to monitor server
processes. Unless you use this you can start it very early, at about the time
wd_keepalive is started. Once the real daemon is running (and syslog) we get
significantly more information in the logs.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org
VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-12-02 Thread Marcus Better
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Michael Meskes wrote:
 You mean that it reboots although watchdog is up and running?

Ok, some more details. It reboots shortly after starting wd_keepalive,
early in the boot sequence. Maybe it is failing to keep up because of
all the disk activity during boot?

After booting single-user, I was able to both start and stop watchdog
without any issues.

 Could you please try with verbose on and send me the log entries?

The -v option in /etc/default/watchdog did not produce any output.
Maybe it was too early in the boot sequence?

Cheers,

Marcus
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAksWOQ0ACgkQXjXn6TzcAQmNKgCgu5gEiR29a7EdRGG/cnFVrxAr
vY8AoO83x0+McEGJj9z6xV72slUTz0v/
=ri+E
-END PGP SIGNATURE-



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-24 Thread Michael Meskes
On Wed, Nov 18, 2009 at 02:08:00PM +0100, Marcus Better wrote:
 My HP Proliant server rebooted tonight for no apparent reason, after
 weeks or months of uptime. After that it started rebooting during the
 boot sequence at around the same point (shortly after going multiuser, I
 think). I tracked it to the watchdog daemon. Disabling it fixed the
 issue. Restarting the watchdog causes a reboot within seconds. Also,
 stopping the watchdog service without having disabled it (thus starting
 wd_keepalive) also reboots the system.

You mean that it reboots although watchdog is up and running?

 I didn't see any watchdog-related error messages in the syslog, but then
 I didn't have verbose mode enabled.

Could you please try with verbose on and send me the log entries? This sounds
kind of strange as if the kernel wasn't accepting the writes watchdog does or
as if watchdog wasn't writing at all.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org
VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-24 Thread Marcus Better
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Michael Meskes wrote:
 You mean that it reboots although watchdog is up and running?

It would seem so.

 I didn't see any watchdog-related error messages in the syslog, but then
 I didn't have verbose mode enabled.
 
 Could you please try with verbose on and send me the log entries?

The machine is in production, so I will have to wait until there is a
scheduled reboot for some other reason.

Marcus
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAksL7uUACgkQXjXn6TzcAQmnMgCg9idlRWwfNIvOtuz+C25XHUAV
HUgAoMDeMZyjf+X/2YNNUkr0e4Pj9JL6
=TYD0
-END PGP SIGNATURE-



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-18 Thread Marcus Better
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

My HP Proliant server rebooted tonight for no apparent reason, after
weeks or months of uptime. After that it started rebooting during the
boot sequence at around the same point (shortly after going multiuser, I
think). I tracked it to the watchdog daemon. Disabling it fixed the
issue. Restarting the watchdog causes a reboot within seconds. Also,
stopping the watchdog service without having disabled it (thus starting
wd_keepalive) also reboots the system.

I didn't see any watchdog-related error messages in the syslog, but then
I didn't have verbose mode enabled.

I'm also using the IPMI watchdog module, with kernel
2.6.26-1-openvz-amd64, watchdog 5.6-8.

Cheers,

Marcus
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAksD8bAACgkQXjXn6TzcAQl4HQCg+NkRwm/OR8KimeYUKD8fK7Hu
lz0AoJHPZP+BYm/we8y81m2k/FeneKED
=sGRq
-END PGP SIGNATURE-



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-08 Thread Michael Meskes
On Sat, Nov 07, 2009 at 10:26:24PM +, Ben Hutchings wrote:
 1. EBUSY indicates that the watchdog is opening it more than once, which
 is obviously incorrect behaviour.

To the best of my knowledge watchdog only opens the device once which obviously
makes your conclusion wrong as well.

 2. Failure to open the device will not result in the device being
 closed, except in the case of (1).

Well I can think of different reasons ...

 3. I had a look at the watchdog daemon's source and repeatedly went
 WTF?.  I am now inclined to assume it is doing the wrong thing unless
 proved otherwise.

Now this is a strong accusation that you hopefully have some proof for.

Michael

-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org
VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-08 Thread Michael Meskes
On Fri, Nov 06, 2009 at 04:00:08PM +, Tim Small wrote:
 Package: linux-image-2.6.26-2-amd64
 Version: 2.6.26-17lenny1
 Severity: normal
 
 Opening /dev/watchdog as provided by ipmi_watchdog on a Dell PowerEdge
 860 running Lenny 5.0 (64 bit), frequently fails with EBUSY.

Could you please try the watchdog daemon package from backports.org? There has
been a race between stopping wd_keepalive and starting watchdog that has been
fixed after Lenny has been released.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org
VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-07 Thread Ben Hutchings
On Fri, 2009-11-06 at 16:00 +, Tim Small wrote:
 Package: linux-image-2.6.26-2-amd64
 Version: 2.6.26-17lenny1
 Severity: normal
 
 Opening /dev/watchdog as provided by ipmi_watchdog on a Dell PowerEdge
 860 running Lenny 5.0 (64 bit), frequently fails with EBUSY.
 
 Nov  5 11:50:09 kernel: [   29.583805] IPMI Watchdog: driver initialized
 Nov  5 11:50:12 watchdog[3239]: starting daemon (5.4):
 Nov  5 11:50:12 watchdog[3239]: int=10s realtime=yes sync=no
 soft=no mla=0 mem=140733193388032
 
 [...]
 
 Nov  5 11:50:12 watchdog[3239]: test=none(0) repair=none
 alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no
 Nov  5 11:50:12 watchdog[3239]: cannot open /dev/watchdog (errno
 = 16 = 'Device or resource busy')
 Nov  5 11:50:12 kernel: [   57.375695] IPMI Watchdog: Unexpected
 close, not stopping watchdog
 
 
 worse, if the module is loaded with the nowayout=1 - the machine then
 gets hard-reset timeout seconds later!

The watchdog device cannot be closed if it was not successfully opened.
This is a problem with the watchdog daemon.

Ben.

-- 
Ben Hutchings
The generation of random numbers is too important to be left to chance.
- Robert Coveyou


signature.asc
Description: This is a digitally signed message part


Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-07 Thread Michael Meskes
On Sat, Nov 07, 2009 at 03:40:22PM +, Ben Hutchings wrote:
  worse, if the module is loaded with the nowayout=1 - the machine then
  gets hard-reset timeout seconds later!
 
 The watchdog device cannot be closed if it was not successfully opened.
 This is a problem with the watchdog daemon.

Ben, would you mind giving us a little bit more of information as to why you 
think
this is a bug in the daemon? I agree that the device cannot be closed if it
wasn't successfully opened. But this does not explain why the device cannot be
opened. Also if the device was somehow opened the system would be hard resetted
no matter if nowayout was set or not. If nowayout has an effect on the reset
the device indeed is closed.

Tim, is there any other software running that accesses /dev/watchdog? What 
happens
if you not start watchdog in the boot process but instead start it manually
once the system is completely booted?

Michael

Michael

-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org
VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s

2009-11-07 Thread Ben Hutchings
On Sat, 2009-11-07 at 22:45 +0100, Michael Meskes wrote:
 On Sat, Nov 07, 2009 at 03:40:22PM +, Ben Hutchings wrote:
   worse, if the module is loaded with the nowayout=1 - the machine then
   gets hard-reset timeout seconds later!
  
  The watchdog device cannot be closed if it was not successfully opened.
  This is a problem with the watchdog daemon.
 
 Ben, would you mind giving us a little bit more of information as to why you 
 think
 this is a bug in the daemon?
[...]

1. EBUSY indicates that the watchdog is opening it more than once, which
is obviously incorrect behaviour.
2. Failure to open the device will not result in the device being
closed, except in the case of (1).
3. I had a look at the watchdog daemon's source and repeatedly went
WTF?.  I am now inclined to assume it is doing the wrong thing unless
proved otherwise.

Ben.

-- 
Ben Hutchings
The generation of random numbers is too important to be left to chance.
- Robert Coveyou



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org