Hi list.

I've been using Heartbeat "heartbeat-2.1.3-22.1.i386" with Pacemaker
"pacemaker-heartbeat-0.6.5-8.1.i386" on HP ProLiant DL380 G4 servers
with iLO firmware "version 1.88 09/19/2006" with the external/riloe
plugin for STONITH.  

I noticed a STONITH issue when I originally set them up where if the
server you're trying to reset is OFF and you run a stonith command like
this one, it returns exit status 0 but the server didn't attempt to
start:
stonith -t external/riloe hostlist=192.168.33.21
ilo_hostname=192.168.33.20  ilo_user=Administrator
ilo_password=xxxxxxxx  ilo_can_reset=1  ilo_protocol=2.0
ilo_powerdown_method=button -T reset 192.168.33.20

This breaks rule #4 here: http://www.linux-ha.org/STONITH

I tried the same command with "-T off" and that returned status 0,
turned the server on for about 1 second then powered it off.  I tried
the same command with "-T on" and it returned status 0 and turned the
server on.  Note that when the server was on, everything worked fine
(they all returned exit status 0, reset reset it, off turned it off and
on kept it on). 

I thought - okay, reset doesn't work when it's OFF, what if I set the
"ilo_can_reset" variable to "0"?  I tried it with all three of the above
commands and got the exact same results.  I looked at the riloe script
for a while (I'm not a python guy) and it seemed that maybe line 174 was
incorrectly checking the value of the "reset_ok" variable such that the
second half of the if statement would always fail.  The relevant section
in the external/riloe script I'm referring to is:

line 174: if cmd == 'reset' and not reset_ok:
                acmds.append(login + todo['off'] + logout)
                acmds.append(login + todo['on'] + logout)
          else:   
                acmds.append(login + todo[cmd] + logout)

So I created an "external/my-riloe" script where line 174 looks like
this instead:
line 174: if cmd == 'reset' and reset_ok == '0':

When I used this new "external/my-riloe" script with "ilo_can_reset=0"
and "-T reset" when the server was OFF, the server turned on for a
second then shut off like the "off" command does, then turned on and
stayed on.  I used the "external/my-riloe" script with Heartbeat and it
worked for my purposes.  It was able to bring up the node when it was
off.  For example if I hard powered down either node the other node
powered it back on.  So, all was well in the server room.  :)

Recently we got some HP ProLiant DL380 G5 servers with iLO 2 firmware
"version 1.50 03/21/2008" and I installed Heartbeat
"heartbeat-2.99.0-3.1.i386" and Pacemaker
"pacemaker-heartbeat-0.6.6-17.2.i386" on them.  What I found when doing
the STONITH command line tests with the "external/riloe" plugin is that
when the node is ON everything works as expected (reset resets, off
turns it off and on turns in on).  However when the server is OFF, on
turns it on, off fails loudly, and reset silently returns exit code 0
but the server stays off.  The fact that OFF returns an error when it's
OFF means that the workaround plugin I was using won't work anymore.

Here's the commands outputs starting from both nodes ON.  You can see
that the first "OFF" works fine, then the second one fails loudly.
After that the "reset" command returns exit status 0 but doesn't bring
up the node:

[EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21
ilo_hostname=192.168.33.20  ilo_user=Administrator
ilo_password=xxxxxxxx  ilo_can_reset=1  ilo_protocol=2.0
ilo_powerdown_method=button -T off 192.168.33.20
[EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21
ilo_hostname=192.168.33.20  ilo_user=Administrator
ilo_password=xxxxxxxx  ilo_can_reset=1  ilo_protocol=2.0
ilo_powerdown_method=button -T off 192.168.33.20
** INFO: external_run_cmd: Calling
'/usr/lib/stonith/plugins/external/riloe off 192.168.33.20' returned 256

** (process:14417): CRITICAL **: external_reset_req: 'riloe off' for
host 192.168.33.20 failed with rc 256
[EMAIL PROTECTED] echo $?
5
[EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21
ilo_hostname=192.168.33.20  ilo_user=Administrator
ilo_password=xxxxxxxx  ilo_can_reset=1  ilo_protocol=2.0
ilo_powerdown_method=button -T reset 192.168.33.20
[EMAIL PROTECTED] echo $?
0

Has anyone else encountered this issue?  

Will upgrading the iLO 2 firmware version to 1.60 fix it?  I didn't see
anything in HP's list of fixes that resembles the issue I'm having.  

Any ideas how I can get "reset" when the server if OFF to turn it on -
either directly or by using "off" then "on"?

Thanks,

-- 
Tyler Sutherland


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to