Hey Frank, That is really random stuff. I am usually always willing to point the finger to myself and say "bug in FreeIPMI", but this is way too random. The "data not available" errors in the log indicate that the packets returned from the BMC are actually malformed.
On the flip side, I've personally never tested on Solaris. So who knows if the "/dev/bmc" was really programmed correctly. I may have screwed up. Is this only happening on one motherboard? I have this feeling that maybe the board is just busted. Al On Tue, 2010-07-06 at 23:52 -0700, Frank Steiner wrote: > Albert Chu wrote > > > Hey Frank, > > > > This is indeed very strange. I assume the reboots are because the timer > > eventually times out, perhaps because the resets are no longer working > > (lets say the BMC goes out to lunch). > > I don't think so because in the tests I repeat the resets every second > and I always see if they succeed or not. Many of them are rejected with > some kind of error messages, but it never happens that all fail for more > than one minute. > > However, when I loop "bmc-watchdog -g" I get the strangest results with > all fields showing complete nonsense, like > Initial Countdown: 6553 sec > Present Countdown: 0 sec > > and a second later > > Initial Countdown: 900 sec > Present Countdown: 24513 sec > > and so on. Also the action field etc. change their values. If the timer > would just run down, the host would reset and not power-off. So I guess > that the ILOM is just that buggy that it can get confused by polling > or resetting it :-( > > > Does the bmc-watchdog log say anything interesting? Normally > > it's /var/log/freeipmi/bmc-watchdog.log. > > It says a lot, but nothing different just before shutting down that it > hadn't showed before. E.g.: > > [Jul 05 08:38:08]: _set_watchdog_timer_cmd: fill_cmd_set_watchdog_timer: > Invalid argument > [Jul 05 08:38:18]: Get Cmd: ipmi_kcs_cmd: driver timeout > [Jul 05 08:38:22]: Get Cmd: cmd error: 2h > [Jul 05 08:38:38]: _get_watchdog_timer_cmd: fiid_obj_get: 'timeout_action': > data not available > [Jul 05 08:38:38]: _set_watchdog_timer_cmd: fill_cmd_set_watchdog_timer: > Invalid argument > [Jul 05 08:38:44]: Set Cmd: ipmi_kcs_cmd: driver timeout > [Jul 05 08:38:50]: Set Cmd: ipmi_kcs_cmd: internal IPMI error > [Jul 05 08:39:01]: Set Cmd: ipmi_kcs_cmd: internal IPMI error > [Jul 05 08:39:23]: _get_watchdog_timer_cmd: fiid_obj_get: 'timeout_action': > data not available > [Jul 05 08:39:27]: _get_watchdog_timer_cmd: fiid_obj_get: > 'initial_countdown_value': data not available > [Jul 05 08:39:35]: _get_watchdog_timer_cmd: fiid_obj_get: > 'initial_countdown_value': data not available > [Jul 05 08:39:51]: Get Cmd: cmd error: 80h > [Jul 05 08:39:52]: Set Cmd: ipmi_kcs_cmd: internal IPMI error > [Jul 05 08:40:21]: Get Cmd: ipmi_kcs_cmd: driver timeout > > > Strange enough, the watchdog reacts a lot quicker and more stable when > I poll it through the network interface by "ipmitool ... bmc watchdog reset" > or "get". > It immediately responds, always with correct values, and never shuts down. > > Maybe that's because I don't have any special driver loaded on Linux? > The sun driver is not available for Linux as far as I understood, so > I'm just using "bmc-watchdog -g" without any drivers. > > cu, > Frank > -- Albert Chu [email protected] Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ Freeipmi-devel mailing list [email protected] http://lists.gnu.org/mailman/listinfo/freeipmi-devel
