[Pacemaker] Shooting and diagnosis of stonith plugins

Takenaka Kazuhiro Thu, 09 Oct 2008 23:37:13 -0700

Hi all.

So far as I know, every stonith plugin is expected to diagnose if
its target is fenced out from the other nodes before it returns
successful status on 'reset' or 'off'.


However, I think this diagnosis is somewhat excess burden for an
indivdual plugin.

Because authors of plugins know how to deal with stonith devices
for which they make plugins, but they can't always expect structure
of clusters on which their plugins will work.

When a clusters administrator try to use some plugin but the diagnosis
of the plugin doesn't match the cluster, the administrator can't help
but directly alter the plugin.

This gets down plugins' adaptiveness and can't be favorable.
One idea to avoid this problem is making schemes or conventions
which enable plugins to delegate the diagnosis to other plugins.

Attached two plugins are a sample of this idea. They work cooperatively
by the attached cib.xml.

'sshAltered' only shoots its targets and 'pingAllAddr' only diagnoses
activity of its targets.

The followings are little more detailed explanations:

  When some accidents made necessary to shoot a corrupted node
  by another node, the shooter node uses 'sshAltered' firstly to
  shoot the target node.

  'sshAltered' shoots its targets but never exits with a successful
  status if the value of attribute 'shoot_only' is "yes" in the same
  way as the attached cib.xml. So, next plugin will be used always
  if it is defined.

  'pingAllAddr' confirms activity of the IP addresses of its targets
  specified in cib.xml. If any of the IP addresses don't respond,
  'pingAllAddr' exits with a successful status, otherwise it
  exits with an error status.

After once 'external/ssh' is rewritten into 'sshAltered', there
is no need to rewrite it again to use other conditions to
confirm targets' death.

For example, if a cluster uses iSCSI shared storages and
a failover action on this cluster must wait for the iSCSI target
devices to sweep connections to the corrupted node, it can do by
the other type plugins instead of 'pingAllAddr'. Their task is to
ask iSCSI target devices about completion of connection sweeping.

Vice-versa is also true. Any plugin which follows the explained
convention can work together with 'pingAllAddr'.

It can also be avalable by another tag-attibute like this:

  <primitive type="external/ssh class="stonith" task="shoot" ...>

I hope some kind of agreement will be made about this problem.

Best regard.
-- 
Takenaka Kazuhiro <[EMAIL PROTECTED]>

#!/bin/bash

# 'sshAltered' is almost same as 'external/ssh' except 2 points.
# 1) This plugin logs some debug messages into /var/log/stonith.log.
# 2) This plugin doesn't ping to confirm death of the target after
#    this shoots them if the value of ${shoot_only} is "yes".

#
# External STONITH module for ssh.
#
# Copyright (c) 2004 SUSE LINUX AG - Lars Marowsky-Bree <[EMAIL PROTECTED]>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#

SSH_COMMAND="/usr/bin/ssh -q -x -o PasswordAuthentication=no -o 
StrictHostKeyChecking=no -n -l root" 
#SSH_COMMAND="/usr/bin/ssh -q -x -n -l root"

REBOOT_COMMAND="echo 'sleep 2; /sbin/reboot -nf' | SHELL=/bin/sh at now 
>/dev/null 2>&1"

# Warning: If you select this poweroff command, it'll physically
# power-off the machine, and quite a number of systems won't be remotely
# revivable.
# TODO: Probably should touch a file on the server instead to just
# prevent heartbeat et al from being started after the reboot.
# POWEROFF_COMMAND="echo 'sleep 2; /sbin/poweroff -nf' | SHELL=/bin/sh at now 
>/dev/null 2>&1"
POWEROFF_COMMAND="echo 'sleep 2; /sbin/reboot -nf' | SHELL=/bin/sh at now 
>/dev/null 2>&1"

# Rewrite the hostlist to accept "," as a delimeter for hostnames too.
hostlist=`echo $hostlist | tr ',' ' '`

is_host_up() {
  for j in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  do
    if
      ping -w1 -c1 "$1" >/dev/null 2>&1
    then
      sleep 1
    else
      return 1
    fi
  done
  return 0
}


savelog() { 
    echo $(date '+%Y%m%d-%H%M%S') ${0##*/} "$@" >> /var/log/stonith.log; }
EXIT() { savelog EXIT $subcmd "$@"; exit "$@";}

savelog "ARGS" "$@" = $hostlist

subcmd=$1

case $1 in
gethosts)
        for h in $hostlist ; do
            echo $h
        done
        EXIT 0
        ;;
on)
        # Can't really be implemented because ssh cannot power on a system
        # when it is powered off.
        EXIT 1
        ;;
off)
        # Shouldn't really be implemented because if ssh cannot power on a 
        # system, it shouldn't be allowed to power it off.
        EXIT 1
        ;;
reset)
        for h in $hostlist
        do
          if
            [ "$h" != "$2" ]
          then
            continue
          fi
          if
            case ${livedangerously} in
              [Yy]*)    is_host_up $h;;
              *)        true;;
             esac
          then
            $SSH_COMMAND "$2" "$REBOOT_COMMAND"
            # Good thing this is only for testing...

            # Shooting only, 
            # in other words, skip status verification of the shot node
            if [[ "$shoot_only" = yes ]]; then
              EXIT 1
            fi

            if
              is_host_up $h
            then
              EXIT 1
            else
              EXIT 0
            fi
          else
            # well... Let's call it successful, after all this is only for 
testing...
            EXIT 0
          fi
        done
        EXIT 1
        ;;
status)
        if
          [ -z "$hostlist" ]
        then
          EXIT 1
        fi
        for h in $hostlist
        do
          if
            ping -w1 -c1 "$h" 2>&1 | grep "unknown host"
          then
            EXIT 1
          fi
        done
        EXIT 0
        ;;
getconfignames)
        echo "hostlist"
        EXIT 0
        ;;
getinfo-devid)
        echo "ssh STONITH device"
        EXIT 0
        ;;
getinfo-devname)
        echo "ssh STONITH external device"
        EXIT 0
        ;;
getinfo-devdescr)
        echo "ssh-based Linux host reset"
        echo "Fine for testing, but not suitable for production!"
        EXIT 0
        ;;
getinfo-devurl)
        echo "http://openssh.org";
        EXIT 0
        ;;
getinfo-xml)
        cat << SSHXML
<parameters>
<parameter name="hostlist" unique="1" required="1">
<content type="string" />
<shortdesc lang="en">
Hostlist
</shortdesc>
<longdesc lang="en">
The list of hosts that the STONITH device controls
</longdesc>
</parameter>

<parameter name="livedangerously" unique="0" required="0">
<content type="enum" />
<shortdesc lang="en">
Live Dangerously!!
</shortdesc>
<longdesc lang="en">
Set to "yes" if you want to risk your system's integrity.
Of course, since this plugin isn't for production, using it
in production at all is a bad idea.  On the other hand,
setting this parameter to yes makes it an even worse idea.
Viva la Vida Loca!
</longdesc>
</parameter>

</parameters>
SSHXML
        EXIT 0
        ;;
*)
        EXIT 1
        ;;
esac

#!/bin/bash

# 'pingAllAddr' doesn't shoot its targets, this plugin only confirms death of 
# the targets. 'pingAllAddr' pings the IP addresses of the targets specified 
# in cib.xml. If any of the IP addresses don't respond,  'pingAllAddr'
# exits with a successful status, otherwise it exits with an error status.

savelog() { 
    echo $(date '+%Y%m%d-%H%M%S') ${0##*/} "$@" >> /var/log/stonith.log; }
EXIT() { savelog EXIT $subcmd "$@"; exit "$@";}

are_all_addrs_dead()
{
    savelog ENTER are_all_addrs_dead
    declare local host=$1
    for name in ${!addrlist*}; do
        savelog ADDR $name
        eval set -- \$$name
        if [[ "$1" = "$host" ]]; then
            shift
            for addr in "$@"; do
                savelog ping $addr
                if ping -w1 -c1 "$addr" >/dev/null 2>&1; then
                    savelog PING OK $addr
                    return 1
                fi
                savelog PING NG $addr
            done
            return 0
        fi
    done
    return 1
}


hostlist=`echo $hostlist | tr ',' ' '`

savelog "ARGS" "$@" = "$hostlist"

subcmd=$1

case $1 in
gethosts)
    for h in $hostlist ; do
        echo $h
    done
    EXIT 0
    ;;
on)
    EXIT 1
    ;;
off)
    EXIT 1
    ;;
reset)
    sleep ${initial_wait:0}
    for h in $hostlist; do
        if [ "$h" != "$2" ]; then
            continue
        fi
        savelog CALL are_all_addrs_dead
        if are_all_addrs_dead $h; then
            EXIT 0
        fi
        EXIT 1
    done
    ;;
status)
    if [ -z "$hostlist" ]; then
        EXIT 1
    fi
    EXIT 0
    ;;
getconfignames)
    echo "hostlist"
    EXIT 0
    ;;
getinfo-devid)
    echo "isNodeAlive"
    EXIT 0
    ;;
getinfo-devname)
    echo "isNodeAlive device"
    EXIT 0
    ;;
getinfo-devdescr)
    echo "isNodeAlive"
    EXIT 0
    ;;
getinfo-devurl)
    echo "http://127.0.0.1";
    EXIT 0
    ;;
getinfo-xml)
        cat << EOX
<parameters>

<parameter name="hostlist" unique="1" required="1">
<content type="string" />
<shortdesc lang="en">
Hostlist
</shortdesc>
<longdesc lang="en">
The list of hosts that the STONITH device controls
</longdesc>
</parameter>

</parameters>
EOX
    EXIT 0
    ;;
*)
    EXIT 1
    ;;
esac

<!-- vim:set sw=2 ts=8: -->
<cib epoch="1" num_updates="1" admin_epoch="0">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
	<attributes>
	  <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
	  <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true"/>
	  <nvpair id="cib-bootstrap-options-default-resource-stickiness" name="default-resource-stickiness" value="INFINITY"/>
	  <nvpair id="cib-bootstrap-options-default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="-INFINITY"/>
	  <nvpair id="cib-bootstrap-options-default-action-timeout" name="default-action-timeout" value="120s"/>
	</attributes>
      </cluster_property_set>
    </crm_config>
    <nodes/>

    <resources>

      <primitive id="dummy" class="ocf" type="Dummy" provider="heartbeat">
	<operations>
	  <op id="dummy:start"   name="start" timeout="30" on_fail="restart"/>
	  <op id="dummy:monitor" name="monitor" timeout="30" on_fail="fence" interval="10"/>
	  <op id="dummy:stop"    name="stop" timeout="30" on_fail="fence"/>
	</operations>
      </primitive>

      <clone id="clnFencing" globally_unique="false">
        <instance_attributes id="clnFencing:attr">
          <attributes>
            <nvpair id="clnFencing:attr:clone_max" name="clone_max" value="2"/>
            <nvpair id="clnFencing:attr:clone_node_max" name="clone_node_max" value="1"/>
          </attributes>
        </instance_attributes>
	<group id="grpFencing">

	  <primitive id="prmSshAltered" class="stonith" type="external/sshAltered">
	    <operations>
	      <op id="prmSshAltered:op:monitor" name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
	      <op id="prmSshAltered:op:start"   name="start" timeout="20s" prereq="nothing"/>
	    </operations>
	    <instance_attributes id="prmSshAltered:attr">
	      <attributes>
		 <nvpair id="prmSshAltered:attr:hostlist"   name="hostlist"   value="node01,node02"/>
		 <nvpair id="prmSshAltered:attr:shoot_only" name="shoot_only" value="yes"/>
	      </attributes>
	    </instance_attributes>
	  </primitive>

	  <primitive id="prmPingAllAddr" class="stonith" type="external/pingAllAddr">
	    <operations>
	      <op id="prmPingAllAddr:op:monitor" name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
	      <op id="prmPingAllAddr:op:start"   name="start" timeout="20s" prereq="nothing"/>
	    </operations>
	    <instance_attributes id="prmPingAllAddr:attr">
	      <attributes>
		 <nvpair id="prmPingAllAddr:attr:hostlist" name="hostlist" value="node01,node02"/>
		 <nvpair id="prmPingAllAddr:attr:initial_wait" name="initial_wait" value="5"/>
		 <nvpair id="prmPingAllAddr:attr:addrlist01" name="addrlist01" value="node01 172.20.24.111 192.168.101.1 192.168.102.1 192.168.110.1"/>
		 <nvpair id="prmPingAllAddr:attr:addrlist02" name="addrlist02" value="node02 172.20.24.112 192.168.101.2 192.168.102.2 192.168.110.2"/>
	      </attributes>
	    </instance_attributes>
	  </primitive>

	</group>
      </clone>

    </resources>

    <constraints>
      <rsc_location rsc="dummy" id="dummy:location1" >
	<rule id="dummy:rule1" score="200">
	  <expression id="dummy:exp1" attribute="#uname" operation="eq" value="node01"/>
	</rule>
	<rule id="dummy:rule2" score="100">
	  <expression id="dummy:exp2" attribute="#uname" operation="eq" value="node02"/>
	</rule>
      </rsc_location>
    </constraints>

  </configuration>
</cib>

_______________________________________________
Pacemaker mailing list
[email protected]
http://list.clusterlabs.org/mailman/listinfo/pacemaker

[Pacemaker] Shooting and diagnosis of stonith plugins

Reply via email to