Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module, frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
Sorry - just to further clarify, this is what would happen without the patch... 1. wd_keepalive daemon is started early in the boot process, loads ipmi_watchdog and opens + starts to write to /dev/watchdog 2. watchdog init script sends TERM to wd_keepalive daemon 3. watchdog init script starts watchdog daemon without waiting for wd_keepalive to exit 4. watchdog daemon attempt to open /dev/watchdog. Fails to do so because device is already open. 5. wd_keepalive daemon closes /dev/watchdog 6. wd_keepalive daemon exits 7. 60 seconds after 5. occurs, the machine gets hard-reset (ipmi_watchdog defaults to nowayout=1 and timeout=60) A similar bug could occur during reboot/shutdown if the machine took more than 60 seconds to reboot (i.e. between /etc/init.d/watchdog stop and actual machine reboot) except the occurrences of watchdog daemon and wd_keepalive daemon would be reversed in the above description. Thanks, Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
Ok, some more details. It reboots shortly after starting wd_keepalive, early in the boot sequence. Maybe it is failing to keep up because of all the disk activity during boot? Maybe. The default is that the device has to be triggered once a minute which seems like quite a lot of time for a program to not being scheduled at all. After booting single-user, I was able to both start and stop watchdog without any issues. Could you please try with verbose on and send me the log entries? The -v option in /etc/default/watchdog did not produce any output. Maybe it was too early in the boot sequence? Sorry, my bad, should have explained better. wd_keepalive does not react on most options. To get more information you have to start watchdog itself. The only reason for it to start late is that it can be used to monitor server processes. Unless you use this you can start it very early, at about the time wd_keepalive is started. Once the real daemon is running (and syslog) we get significantly more information in the logs. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Michael Meskes wrote: You mean that it reboots although watchdog is up and running? Ok, some more details. It reboots shortly after starting wd_keepalive, early in the boot sequence. Maybe it is failing to keep up because of all the disk activity during boot? After booting single-user, I was able to both start and stop watchdog without any issues. Could you please try with verbose on and send me the log entries? The -v option in /etc/default/watchdog did not produce any output. Maybe it was too early in the boot sequence? Cheers, Marcus -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAksWOQ0ACgkQXjXn6TzcAQmNKgCgu5gEiR29a7EdRGG/cnFVrxAr vY8AoO83x0+McEGJj9z6xV72slUTz0v/ =ri+E -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
On Wed, Nov 18, 2009 at 02:08:00PM +0100, Marcus Better wrote: My HP Proliant server rebooted tonight for no apparent reason, after weeks or months of uptime. After that it started rebooting during the boot sequence at around the same point (shortly after going multiuser, I think). I tracked it to the watchdog daemon. Disabling it fixed the issue. Restarting the watchdog causes a reboot within seconds. Also, stopping the watchdog service without having disabled it (thus starting wd_keepalive) also reboots the system. You mean that it reboots although watchdog is up and running? I didn't see any watchdog-related error messages in the syslog, but then I didn't have verbose mode enabled. Could you please try with verbose on and send me the log entries? This sounds kind of strange as if the kernel wasn't accepting the writes watchdog does or as if watchdog wasn't writing at all. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Michael Meskes wrote: You mean that it reboots although watchdog is up and running? It would seem so. I didn't see any watchdog-related error messages in the syslog, but then I didn't have verbose mode enabled. Could you please try with verbose on and send me the log entries? The machine is in production, so I will have to wait until there is a scheduled reboot for some other reason. Marcus -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAksL7uUACgkQXjXn6TzcAQmnMgCg9idlRWwfNIvOtuz+C25XHUAV HUgAoMDeMZyjf+X/2YNNUkr0e4Pj9JL6 =TYD0 -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 My HP Proliant server rebooted tonight for no apparent reason, after weeks or months of uptime. After that it started rebooting during the boot sequence at around the same point (shortly after going multiuser, I think). I tracked it to the watchdog daemon. Disabling it fixed the issue. Restarting the watchdog causes a reboot within seconds. Also, stopping the watchdog service without having disabled it (thus starting wd_keepalive) also reboots the system. I didn't see any watchdog-related error messages in the syslog, but then I didn't have verbose mode enabled. I'm also using the IPMI watchdog module, with kernel 2.6.26-1-openvz-amd64, watchdog 5.6-8. Cheers, Marcus -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAksD8bAACgkQXjXn6TzcAQl4HQCg+NkRwm/OR8KimeYUKD8fK7Hu lz0AoJHPZP+BYm/we8y81m2k/FeneKED =sGRq -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
On Sat, Nov 07, 2009 at 10:26:24PM +, Ben Hutchings wrote: 1. EBUSY indicates that the watchdog is opening it more than once, which is obviously incorrect behaviour. To the best of my knowledge watchdog only opens the device once which obviously makes your conclusion wrong as well. 2. Failure to open the device will not result in the device being closed, except in the case of (1). Well I can think of different reasons ... 3. I had a look at the watchdog daemon's source and repeatedly went WTF?. I am now inclined to assume it is doing the wrong thing unless proved otherwise. Now this is a strong accusation that you hopefully have some proof for. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
On Fri, Nov 06, 2009 at 04:00:08PM +, Tim Small wrote: Package: linux-image-2.6.26-2-amd64 Version: 2.6.26-17lenny1 Severity: normal Opening /dev/watchdog as provided by ipmi_watchdog on a Dell PowerEdge 860 running Lenny 5.0 (64 bit), frequently fails with EBUSY. Could you please try the watchdog daemon package from backports.org? There has been a race between stopping wd_keepalive and starting watchdog that has been fixed after Lenny has been released. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
On Fri, 2009-11-06 at 16:00 +, Tim Small wrote: Package: linux-image-2.6.26-2-amd64 Version: 2.6.26-17lenny1 Severity: normal Opening /dev/watchdog as provided by ipmi_watchdog on a Dell PowerEdge 860 running Lenny 5.0 (64 bit), frequently fails with EBUSY. Nov 5 11:50:09 kernel: [ 29.583805] IPMI Watchdog: driver initialized Nov 5 11:50:12 watchdog[3239]: starting daemon (5.4): Nov 5 11:50:12 watchdog[3239]: int=10s realtime=yes sync=no soft=no mla=0 mem=140733193388032 [...] Nov 5 11:50:12 watchdog[3239]: test=none(0) repair=none alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no Nov 5 11:50:12 watchdog[3239]: cannot open /dev/watchdog (errno = 16 = 'Device or resource busy') Nov 5 11:50:12 kernel: [ 57.375695] IPMI Watchdog: Unexpected close, not stopping watchdog worse, if the module is loaded with the nowayout=1 - the machine then gets hard-reset timeout seconds later! The watchdog device cannot be closed if it was not successfully opened. This is a problem with the watchdog daemon. Ben. -- Ben Hutchings The generation of random numbers is too important to be left to chance. - Robert Coveyou signature.asc Description: This is a digitally signed message part
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
On Sat, Nov 07, 2009 at 03:40:22PM +, Ben Hutchings wrote: worse, if the module is loaded with the nowayout=1 - the machine then gets hard-reset timeout seconds later! The watchdog device cannot be closed if it was not successfully opened. This is a problem with the watchdog daemon. Ben, would you mind giving us a little bit more of information as to why you think this is a bug in the daemon? I agree that the device cannot be closed if it wasn't successfully opened. But this does not explain why the device cannot be opened. Also if the device was somehow opened the system would be hard resetted no matter if nowayout was set or not. If nowayout has an effect on the reset the device indeed is closed. Tim, is there any other software running that accesses /dev/watchdog? What happens if you not start watchdog in the boot process but instead start it manually once the system is completely booted? Michael Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org ICQ: 179140304, AIM/Yahoo/Skype: michaelmeskes, Jabber: mes...@jabber.org VfL Borussia! Forca Barca! Go SF 49ers! Use: Debian GNU/Linux, PostgreSQL -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#554793: linux-image-2.6.26-2-amd64: ipmi_watchdog module frequently returns errno=16 - Device or resource busy on Dell Poweredge 860s
On Sat, 2009-11-07 at 22:45 +0100, Michael Meskes wrote: On Sat, Nov 07, 2009 at 03:40:22PM +, Ben Hutchings wrote: worse, if the module is loaded with the nowayout=1 - the machine then gets hard-reset timeout seconds later! The watchdog device cannot be closed if it was not successfully opened. This is a problem with the watchdog daemon. Ben, would you mind giving us a little bit more of information as to why you think this is a bug in the daemon? [...] 1. EBUSY indicates that the watchdog is opening it more than once, which is obviously incorrect behaviour. 2. Failure to open the device will not result in the device being closed, except in the case of (1). 3. I had a look at the watchdog daemon's source and repeatedly went WTF?. I am now inclined to assume it is doing the wrong thing unless proved otherwise. Ben. -- Ben Hutchings The generation of random numbers is too important to be left to chance. - Robert Coveyou -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org