Re: [Pacemaker] Shooting and diagnosis of stonith plugins

Takenaka Kazuhiro Tue, 14 Oct 2008 22:16:11 -0700

Hi Dejan.

Hi Takenaka-san,


On Fri, Oct 10, 2008 at 03:30:27PM +0900, Takenaka Kazuhiro wrote:

Hi all.

So far as I know, every stonith plugin is expected to diagnose if
its target is fenced out from the other nodes before it returns
successful status on 'reset' or 'off'.


It depends on the stonith device. Sometimes it is enough just to
send the reset command and let the device deal with it. Sometimes
it is necessary to check the current power state. However, it
looks like this is not what you want to talk about.


You said "The point of a stonith operation is to ensure that a host
is down or rebooted." in the following thread.

http://lists.community.tummy.com/pipermail/linux-ha/2008-August/034323.html

So I have thought any stonith plugin should make sure if its target
is down or rebooted before it returns. However, this isn't a main
issue just as you understood.

However, I think this diagnosis is somewhat excess burden for an
indivdual plugin.


Actually, the stonith plugins are not required to know the state

... snip ...


  <primitive type="external/ssh class="stonith" task="shoot" ...>

I hope some kind of agreement will be made about this problem.


Please let me put aside your comments abobe for now.
I have an question about your comments below and I'd like
you to answer it first.

This new concept does make sense with the ssh plugin. However,
all other plugins function in a significantly different way and I
don't see how this can apply to them.

Thanks,

Dejan


Yes. 'ssh' is so different from 'ibmrsa-telnet'.

'ssh' shooots a target via a NIC.
'ibmrsa-telnet' shooots a target via a RSA.

So, these devices must lost their power when power-faults
occur on their host machines

In this case, neither 'ssh' nor 'ibmrsa-telnet' can deal with
their target devices. They gets a explicit connection failure
in this situation.

But what actually follows is so different.

In the case where 'ssh' is used as a stonith plugin, it returns a
successful status and the suspended resources are resumed on the
other nodes.

On the other hand, In the case where 'ibmrsa-telnet' is used,
it returns an error status and the suspended resources are not
resumed anywhere. (I think 'ibmrsa-telnet' isn't only one plugin
that works in this way. 'ibmrsa' and 'ipmi' also should work in
the same way.)

'ssh' and 'ibmrsa-telnet' measure success and failure of
shooting targets in different way and it makes difference
of these results.

'ssh' never checks whether it could deal with its target device.
Even if the deal failes explicitly, 'ssh' ignores it.
Instead, 'ssh' always returns its status according to a subsequent
ping check.

On the other hand, 'ibmrsa-telnet' returns its status according
to if it could deal with the device. Whenever 'ibmrsa-telnet'
gets any explicit failure with dealing, it returns an error
status. 'ibmrsa-telnet' never checks target's status in any way.

Which is a correct implementation as a stonith plugin?
In the other words, When a explicit connection error occurs
during a stonith action, How should stonith plugins do?

I have believed 'ssh' goes right way. Because I have thought
a stonith plugin which failes a failover on a power fault
is out of problem.

If 'ibmrsa-telnet' goes right way, it means any stonith plugin
that can't shoot a host machine with a power fault must not
be used alone. They must use with some other plugin which checks
if its target machines is running or not.

Dejan Muhamedagic wrote:

Hi Takenaka-san,

On Fri, Oct 10, 2008 at 03:30:27PM +0900, Takenaka Kazuhiro wrote:

Hi all.

So far as I know, every stonith plugin is expected to diagnose if
its target is fenced out from the other nodes before it returns
successful status on 'reset' or 'off'.


It depends on the stonith device. Sometimes it is enough just to
send the reset command and let the device deal with it. Sometimes
it is necessary to check the current power state. However, it
looks like this is not what you want to talk about.

However, I think this diagnosis is somewhat excess burden for an
indivdual plugin.


Actually, the stonith plugins are not required to know the state
of the host. They just make sure that the host is in a certain
state or that it is reset. This normally doesn't involve the host
itself, just the device which can manage it. Put in other words:
If you pull the power plug or press the reset button there's no
need to try ping or ssh or whatever else to verify that the host
really went down.

Because authors of plugins know how to deal with stonith devices
for which they make plugins, but they can't always expect structure
of clusters on which their plugins will work.

When a clusters administrator try to use some plugin but the diagnosis
of the plugin doesn't match the cluster, the administrator can't help
but directly alter the plugin.

This gets down plugins' adaptiveness and can't be favorable.
One idea to avoid this problem is making schemes or conventions
which enable plugins to delegate the diagnosis to other plugins.

Attached two plugins are a sample of this idea. They work cooperatively
by the attached cib.xml.


It is an interesting idea. It seems like it would require that
all existing stonith plugins return false so that the next, the
"test status" plugin can report the state of the host.

'sshAltered' only shoots its targets and 'pingAllAddr' only diagnoses
activity of its targets.

The followings are little more detailed explanations:

  When some accidents made necessary to shoot a corrupted node
  by another node, the shooter node uses 'sshAltered' firstly to
  shoot the target node.

  'sshAltered' shoots its targets but never exits with a successful
  status if the value of attribute 'shoot_only' is "yes" in the same
  way as the attached cib.xml. So, next plugin will be used always
  if it is defined.

  'pingAllAddr' confirms activity of the IP addresses of its targets
  specified in cib.xml. If any of the IP addresses don't respond,
  'pingAllAddr' exits with a successful status, otherwise it
  exits with an error status.

After once 'external/ssh' is rewritten into 'sshAltered', there
is no need to rewrite it again to use other conditions to
confirm targets' death.

For example, if a cluster uses iSCSI shared storages and
a failover action on this cluster must wait for the iSCSI target
devices to sweep connections to the corrupted node, it can do by
the other type plugins instead of 'pingAllAddr'. Their task is to
ask iSCSI target devices about completion of connection sweeping.

Vice-versa is also true. Any plugin which follows the explained
convention can work together with 'pingAllAddr'.

It can also be avalable by another tag-attibute like this:

  <primitive type="external/ssh class="stonith" task="shoot" ...>

I hope some kind of agreement will be made about this problem.


This new concept does make sense with the ssh plugin. However,
all other plugins function in a significantly different way and I
don't see how this can apply to them.

Thanks,

Dejan

Best regard.
--
Takenaka Kazuhiro <[EMAIL PROTECTED]>





_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

Re: [Pacemaker] Shooting and diagnosis of stonith plugins

Reply via email to