Re: [Linux-HA] riloe: both nodes are STONITHed

Dejan Muhamedagic Wed, 04 Jun 2008 13:06:07 -0700

Hi,

On Wed, Jun 04, 2008 at 01:02:33PM +0200, Zoltan Boszormenyi wrote:
> Dejan Muhamedagic ?rta:
>> Hi,
>>
>> On Wed, Jun 04, 2008 at 10:38:16AM +0200, Zoltan Boszormenyi wrote:
>>   
>>> We had a similar problem at a customer. The solution was to set these:
>>>
>>> ilo_can_reset = 1
>>> ilo_powerdown_method = reset
>>>
>>> Despite the reset method, the machine was powered down.
>>>     
>>
>> Instead of being rebooted? Did you try a riloe management client
>> (I suppose that there is one)?
>>   
>
> Yes, instead of being rebooted. The web iLO interface worked just fine, 
> i.e.
> a reset made it reboot immediately, a "4 second button press" made it
> power down immediately, etc. But external/riloe stonith module triggered
> a shutdown in every case. We were negatively impressed by this behaviour.


I can only imagine :) Well, the only explanation I can think of
is that the device you have doesn't support (or supports in a
wrong way) protocol supplied by the external/riloe plugin.
There's also this comment in the metadata for the ilo_can_reset
command:

   Does the ILO device support RESET commands (hint: older ones
   cannot)

If you leave ilo_can_reset at its default, then the "reset"
command is going to be translated to "off" followed by "on"
(power-wise). Does that help?

>>> We also tried
>>>
>>> ilo_powerdown_method = button
>>>
>>> but this made the stonith resource to fail to start when one nodes was
>>> powering up while the other stayed powered down.
>>>     
>>
>> Sorry, don't get this one. A stonith resource would fail to start
>> in case the monitor operation fails, e.g. when the stonith device
>> is not reachable. I don't see how that should be the case, unless
>> the whole box looses power. Disclaimer: No experience with riloe.
>>   
>
> Yes, it was strange for us, too. The management interface was working,
> the Windows browser client was working fine. But nevertheless, booting up
> one node with heartbeat gave us a failed stonith resource.
> One small detail is we started heartbeart manually and the node that became 
> alive
> shot the other down when both were booted up but only one was running 
> heartbeat.
> Actually it was just starting it and initializing stonith.
> Our theory for the one powered up/one powered down case was that
> external/riloe returned an error if it issued a "button" command to the
> management interface and the node behind it was already off.

The plugin won't issue the "button" command to check status.
Actually, it uses the "status" command in that case. Perhaps this
particular command doesn't work in this case or works in an
unexpected way. You can also try to test the device with the
stonith program. Use '-d' to get debugging output as well.

Since you said that this happened at the customer's site, I guess
that it won't be easy to do the testing. Perhaps you can post
the device model they have.

Thanks,

Dejan

> The web 
> interface
> may ignore it (because it's a NOP) but riloe was returning it as a failure.
>
> Best regards,
> Zolt?n B?sz?rm?nyi
>
>> Thanks,
>>
>> Dejan
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>>   
>
>
> -- 
> ----------------------------------
> Zolt?n B?sz?rm?nyi
> Cybertec Sch?nig & Sch?nig GmbH
> http://www.postgresql.at/
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] riloe: both nodes are STONITHed

Reply via email to