Hello,
I just hacked up a crm nagios plugin which works for me. It does not
check "crm_verify -LV" but I am going to add that. I don't like it very
much but it does a good job for me. Is there a way to get the
informations I currently check out of "cibadmin -o status -Q" or
something like that in a way that I don't have to do wild guessing on
the output of "crm_mon -1 -r"? The information I currently check for are
the following:
- Are all nodes running?
- Is there a DC?
- Are all resources started?
- Are there any orphaned resources?
- Are there any failed actions?
I attached my "check_crm" nagios plugin. I also attached a OCF Resource
Agent that works perfectly on top of the tomcat init script which comes
with Debian Etch. So they're at least documented.
Thomas
#!/usr/bin/perl -w
use strict;
use warnings FATAL => 'all';
my $exit_value = 0;
my @output = ();
my $NODES = 0;
my $NODES_ONLINE = 0;
my $RESOURCES = 0;
my $RESOURCES_ONLINE = 0;
my $RESOURCES_ORPHANED = 0;
my $ACTIONS_FAILED = 0;
my @input = `crm_mon -1 -r`;
chomp(@input);
sub
set_exit_value
{
my $request = shift;
if ($exit_value < $request) {
$exit_value = $request;
}
}
sub
check_dc
{
if (grep(/Current DC:/, @input)) {
push(@output, "DC choosen;");
} else {
push(@output, "No DC choosen;");
set_exit_value(1);
}
}
sub
check_nodes
{
for my $node (grep(/^Node:/, @input)) {
if ($node =~ /online$/) {
$NODES_ONLINE++;
}
$NODES++;
}
if ($NODES_ONLINE == 0) {
set_exit_value(2);
} elsif ($NODES != $NODES_ONLINE) {
set_exit_value(1);
}
push(@output, "${NODES_ONLINE}/${NODES} nodes online;");
}
sub
check_ressources
{
my $section;
{
my $text = join("\n", @input);
$text =~ /Full list of resources:\n\n((\n|.)+)\n?\n?/g;
$section = $1;
}
for my $line (split("\n", $section)) {
if ($line =~ /Started /) {
$RESOURCES_ONLINE++;
$RESOURCES++;
} elsif ($line =~ /Master /) {
$RESOURCES_ONLINE++;
$RESOURCES++;
} elsif ($line =~ /Stopped/) {
$RESOURCES++;
}
if ($line =~ /ORPHANED/) {
$RESOURCES_ORPHANED++;
}
}
if ($RESOURCES) {
if ($RESOURCES_ONLINE == 0) {
set_exit_value(2);
} elsif ($RESOURCES != $RESOURCES_ONLINE) {
set_exit_value(1);
}
push(@output, "${RESOURCES_ONLINE}/${RESOURCES} resources
online;");
}
if ($RESOURCES_ORPHANED) {
set_exit_value(1);
push(@output, "${RESOURCES_ORPHANED} orphaned resources;");
}
}
sub
check_for_failed_actions
{
my @section = @input;
shift(@section) while (defined($section[0]) && $section[0] !~ /^Failed
actions:$/);
for my $line (@section) {
if ($line =~ /Error/) {
$ACTIONS_FAILED++;
}
}
if (@section) {
push(@output, "$ACTIONS_FAILED failed actions;");
set_exit_value(1);
}
}
check_dc();
check_nodes();
check_ressources();
check_for_failed_actions();
if ($exit_value == 0) {
print "OK - ";
} elsif ($exit_value == 1) {
print "WARNING - ";
} elsif ($exit_value == 2) {
print "CRITICAL - ";
} else {
print "UNKNOWN - ";
}
print join (" ", @output);
print "\n";
exit($exit_value);
__DATA__
============
Last updated: Wed Jan 2 09:11:40 2008
Current DC: postgres-01 (24a3fa1b-6b62-470c-a6e1-4c1598875018)
2 Nodes configured.
2 Resources configured.
============
Node: postgres-02 (211523e0-a549-49b7-bf29-f646915698ef): online
Node: postgres-01 (24a3fa1b-6b62-470c-a6e1-4c1598875018): online
Full list of resources:
Master/Slave Set: ms-drbd0
drbd0:0 (heartbeat::ocf:drbd): Stopped
drbd0:1 (heartbeat::ocf:drbd): Stopped
Resource Group: postgres-cluster
fs0 (heartbeat::ocf:Filesystem): Stopped
ip0 (heartbeat::ocf:IPaddr2): Stopped
pgsql0 (heartbeat::ocf:pgsql): Stopped
Failed actions:
drbd0:0_start_0 (node=postgres-01, call=6, rc=1): Error
drbd0:1_start_0 (node=postgres-01, call=9, rc=1): Error
drbd0:0_start_0 (node=postgres-02, call=6, rc=1): Error
drbd0:1_start_0 (node=postgres-02, call=9, rc=1): Error
============
Last updated: Mon Dec 31 13:52:22 2007
Current DC: tomcat-02 (e2607dae-3635-495e-b14f-f90f5dbb4a0e)
1 Nodes configured.
1 Resources configured.
============
Node: tomcat-02 (e2607dae-3635-495e-b14f-f90f5dbb4a0e): online
Full list of resources:
tomcat (heartbeat::ocf:tomcattg): Stopped
tomcat-02 (heartbeat::ocf:tomcattg ORPHANED): Started tomcat-02
============
Last updated: Wed Jan 2 19:51:53 2008
Current DC: postgres-01 (24a3fa1b-6b62-470c-a6e1-4c1598875018)
2 Nodes configured.
2 Resources configured.
============
Node: postgres-02 (211523e0-a549-49b7-bf29-f646915698ef): online
Node: postgres-01 (24a3fa1b-6b62-470c-a6e1-4c1598875018): online
Full list of resources:
Master/Slave Set: ms-drbd0
drbd0:0 (heartbeat::ocf:drbd): Master postgres-02
drbd0:1 (heartbeat::ocf:drbd): Slave postgres-01
Resource Group: postgres-cluster
fs0 (heartbeat::ocf:Filesystem): Started postgres-02
ip0 (heartbeat::ocf:IPaddr2): Started postgres-02
pgsql0 (heartbeat::ocf:pgsql): Started postgres-02
#!/bin/sh
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# OCF Ressource Agent on top of tomcat init script shipped with debian. #
# Thomas Glanzmann --tg 21:22 07-12-30 #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# This script manages a Heartbeat Tomcat instance
# usage: $0 {start|stop|status|monitor|meta-data}
# OCF exit codes are defined via ocf-shellfuncs
. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
case "$1" in
start)
/etc/init.d/tomcat5.5 start > /dev/null 2>&1 && exit || exit 1
;;
stop)
/etc/init.d/tomcat5.5 stop > /dev/null 2>&1 && exit || exit 1
;;
status)
/etc/init.d/tomcat5.5 status > /dev/null 2>&1 && exit || exit 1
;;
monitor)
# Check if Ressource is stopped
/etc/init.d/tomcat5.5 status > /dev/null 2>&1 || exit 7
# Otherwise check services (XXX: Maybe loosen retry / timeout)
wget -o /dev/null -O /dev/null -T 1 -t 1
http://localhost:8180/eccar/ && exit || exit 1
;;
meta-data)
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="tomcattg">
<version>1.0</version>
<longdesc lang="en">
OCF Ressource Agent on top of tomcat init script shipped with debian.
</longdesc>
<shortdesc lang="en">OCF Ressource Agent on top of tomcat init script shipped
with debian.</shortdesc>
<actions>
<action name="start" timeout="90" />
<action name="stop" timeout="100" />
<action name="status" timeout="60" />
<action name="monitor" depth="0" timeout="30s" interval="10s" start-delay="10s"
/>
<action name="meta-data" timeout="5s" />
<action name="validate-all" timeout="20s" />
</actions>
</resource-agent>
END
;;
esac
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems