Re: [Pacemaker] Shooting and diagnosis of stonith plugins

Takenaka Kazuhiro Fri, 17 Oct 2008 01:43:38 -0700

Hi Dejan

>> B is remarked in http://www.linux-ha.org/STONITH.
>>   it says like this.
>>
>>     3. When given a RESET or OFF command it must not return
>>        control to its caller until the node is no longer running.
>
> stonithd retries forever.


What I cited is an article about stonith plugins, Not stonithd.
I cite it further.

=========================
There are a few properties a STONITH plugin must have for it
to be usable in Heartbeat:
  1. ...
  2. ...
  3. When given a RESET or OFF command it must not return
     control to its caller until the node is no longer running.
  4. ...
=========================

Or are there other documents for stonith plugins of Pacemaker?

> Sorry, but I really can't see where's the issue here.

I am looking for alternative criteria to 'ssh'
I want to know how to make correct stonith plugins for Pacemaker.
(This no longer doesn't match the title, Sorry.
 Should I make another thread for this issue?)

Are the articles about stonith plugins in http://www.linux-ha.org/
also right for Pacemaker?

Dejan Muhamedagic wrote:

Hi,

On Thu, Oct 16, 2008 at 06:00:16PM +0900, Takenaka Kazuhiro wrote:

Hi Dejan.

If 'ibmrsa-telnet' goes right way, it means any stonith plugin
that can't shoot a host machine with a power fault must not
be used alone. They must use with some other plugin which checks
if its target machines is running or not.

This is an inherent problem of the lights-out devices such as IBM
RSA or HP iLO, i.e. that they share power source with the node
they manage. Power failure renders this kind of stonith device
useless. Unfortunately, there's nothing one can do about it.

But something must be done.


If one had only means :) The only way to deal with this would be
to fall back to meatware.

In this case, what a plugin can do is one of the following:

  A) Check the target by another way.
  B) Retry forever.
  C) Return failure to caller.

A is what 'ssh' does.
  And you said 'ssh' isn't a production.
  Does it mean any other real stonith plugin must not do A?

B is remarked in http://www.linux-ha.org/STONITH.
  it says like this.

    3. When given a RESET or OFF command it must not return
       control to its caller until the node is no longer running.


stonithd retries forever.

  Any plugin follows B keeps running until stonithd kills it
  on an error.

C is what 'ibmrsa-telnet' does.
  Any plugin follows C returns failure on an error immediatly.
  But I don't know any document which encourages C.


That's all fine, but remember that the said plugin can't reach
the stonith device. Hence, all it can do is report an error.

Sorry, but I really can't see where's the issue here.

Thanks,

Dejan

Which is a right choice for real stonith plugins?

Dejan Muhamedagic wrote:

Hi Takenaka-san,

On Wed, Oct 15, 2008 at 02:09:17PM +0900, Takenaka Kazuhiro wrote:

Hi Dejan.

Hi Takenaka-san,

On Fri, Oct 10, 2008 at 03:30:27PM +0900, Takenaka Kazuhiro wrote:

Hi all.

So far as I know, every stonith plugin is expected to diagnose if
its target is fenced out from the other nodes before it returns
successful status on 'reset' or 'off'.

It depends on the stonith device. Sometimes it is enough just to
send the reset command and let the device deal with it. Sometimes
it is necessary to check the current power state. However, it
looks like this is not what you want to talk about.

You said "The point of a stonith operation is to ensure that a host
is down or rebooted." in the following thread.

http://lists.community.tummy.com/pipermail/linux-ha/2008-August/034323.html

So I have thought any stonith plugin should make sure if its target
is down or rebooted before it returns. However, this isn't a main
issue just as you understood.

However, I think this diagnosis is somewhat excess burden for an
indivdual plugin.

Actually, the stonith plugins are not required to know the state

... snip ...

  <primitive type="external/ssh class="stonith" task="shoot" ...>

I hope some kind of agreement will be made about this problem.

Please let me put aside your comments abobe for now.
I have an question about your comments below and I'd like
you to answer it first.

This new concept does make sense with the ssh plugin. However,
all other plugins function in a significantly different way and I
don't see how this can apply to them.

Thanks,

Dejan

Yes. 'ssh' is so different from 'ibmrsa-telnet'.

'ssh' shooots a target via a NIC.
'ibmrsa-telnet' shooots a target via a RSA.

Actually, I'd rather leave ssh out of this discussion. It was
never meant for production, just for testing.

So, these devices must lost their power when power-faults
occur on their host machines

In this case, neither 'ssh' nor 'ibmrsa-telnet' can deal with
their target devices. They gets a explicit connection failure
in this situation.

But what actually follows is so different.

In the case where 'ssh' is used as a stonith plugin, it returns a
successful status and the suspended resources are resumed on the
other nodes.

On the other hand, In the case where 'ibmrsa-telnet' is used,
it returns an error status and the suspended resources are not
resumed anywhere. (I think 'ibmrsa-telnet' isn't only one plugin
that works in this way. 'ibmrsa' and 'ipmi' also should work in
the same way.)

'ssh' and 'ibmrsa-telnet' measure success and failure of
shooting targets in different way and it makes difference
of these results.

'ssh' never checks whether it could deal with its target device.
Even if the deal failes explicitly, 'ssh' ignores it.
Instead, 'ssh' always returns its status according to a subsequent
ping check.

On the other hand, 'ibmrsa-telnet' returns its status according
to if it could deal with the device. Whenever 'ibmrsa-telnet'
gets any explicit failure with dealing, it returns an error
status. 'ibmrsa-telnet' never checks target's status in any way.

Which is a correct implementation as a stonith plugin?

Both. Note that ssh relies on the network, hence using ping to
verify the host status is fine. However, for a "real" stonith
device such as RSA doing that would be wrong.

In the other words, When a explicit connection error occurs
during a stonith action, How should stonith plugins do?

I have believed 'ssh' goes right way. Because I have thought
a stonith plugin which failes a failover on a power fault
is out of problem.

If the stonith device cannot be reached then we don't know if the
host is running or not. Hence we have to assume the worst case.

If 'ibmrsa-telnet' goes right way, it means any stonith plugin
that can't shoot a host machine with a power fault must not
be used alone. They must use with some other plugin which checks
if its target machines is running or not.

This is an inherent problem of the lights-out devices such as IBM
RSA or HP iLO, i.e. that they share power source with the node
they manage. Power failure renders this kind of stonith device
useless. Unfortunately, there's nothing one can do about it.

Thanks,

Dejan

Dejan Muhamedagic wrote:

Hi Takenaka-san,

On Fri, Oct 10, 2008 at 03:30:27PM +0900, Takenaka Kazuhiro wrote:

Hi all.

So far as I know, every stonith plugin is expected to diagnose if
its target is fenced out from the other nodes before it returns
successful status on 'reset' or 'off'.

It depends on the stonith device. Sometimes it is enough just to
send the reset command and let the device deal with it. Sometimes
it is necessary to check the current power state. However, it
looks like this is not what you want to talk about.

However, I think this diagnosis is somewhat excess burden for an
indivdual plugin.

Actually, the stonith plugins are not required to know the state
of the host. They just make sure that the host is in a certain
state or that it is reset. This normally doesn't involve the host
itself, just the device which can manage it. Put in other words:
If you pull the power plug or press the reset button there's no
need to try ping or ssh or whatever else to verify that the host
really went down.

Because authors of plugins know how to deal with stonith devices
for which they make plugins, but they can't always expect structure
of clusters on which their plugins will work.

When a clusters administrator try to use some plugin but the diagnosis
of the plugin doesn't match the cluster, the administrator can't help
but directly alter the plugin.

This gets down plugins' adaptiveness and can't be favorable.
One idea to avoid this problem is making schemes or conventions
which enable plugins to delegate the diagnosis to other plugins.

Attached two plugins are a sample of this idea. They work cooperatively
by the attached cib.xml.

It is an interesting idea. It seems like it would require that
all existing stonith plugins return false so that the next, the
"test status" plugin can report the state of the host.

'sshAltered' only shoots its targets and 'pingAllAddr' only diagnoses
activity of its targets.

The followings are little more detailed explanations:

  When some accidents made necessary to shoot a corrupted node
  by another node, the shooter node uses 'sshAltered' firstly to
  shoot the target node.

  'sshAltered' shoots its targets but never exits with a successful
  status if the value of attribute 'shoot_only' is "yes" in the same
  way as the attached cib.xml. So, next plugin will be used always
  if it is defined.

  'pingAllAddr' confirms activity of the IP addresses of its targets
  specified in cib.xml. If any of the IP addresses don't respond,
  'pingAllAddr' exits with a successful status, otherwise it
  exits with an error status.

After once 'external/ssh' is rewritten into 'sshAltered', there
is no need to rewrite it again to use other conditions to
confirm targets' death.

For example, if a cluster uses iSCSI shared storages and
a failover action on this cluster must wait for the iSCSI target
devices to sweep connections to the corrupted node, it can do by
the other type plugins instead of 'pingAllAddr'. Their task is to
ask iSCSI target devices about completion of connection sweeping.

Vice-versa is also true. Any plugin which follows the explained
convention can work together with 'pingAllAddr'.

It can also be avalable by another tag-attibute like this:

  <primitive type="external/ssh class="stonith" task="shoot" ...>

I hope some kind of agreement will be made about this problem.

This new concept does make sense with the ssh plugin. However,
all other plugins function in a significantly different way and I
don't see how this can apply to them.

Thanks,

Dejan

Best regard.
--
Takenaka Kazuhiro <[EMAIL PROTECTED]>



_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

--
?????? ??????
Takenaka Kazuhiro <[EMAIL PROTECTED]>
NTT OSS????????? ??????????????????
TEL 03-5860-5135



_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker


_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

--
Takenaka Kazuhiro <[EMAIL PROTECTED]>

_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

Re: [Pacemaker] Shooting and diagnosis of stonith plugins

Reply via email to