Package: redhat-cluster
Severity: important
*** Please type your report below this line ***
An unresolved upstream bug (which includes a patch) affects the
usability of the cluster suite on Debian Lenny.
https://bugzilla.redhat.com/show_bug.cgi?id=512512
This bug prevents the cluster from correctly handling the relocation
of services that were running on a failed node.
The problem is that fenced correctly handles the fencing of the failed
node, but is unable to correctly communicate
this to the cman module - which in turn fails to notify rgmanager that
the node has been fenced and any services
it was running are safe to be relocated to other nodes.
This bug makes the cluster suite on Debian Lenny unusable for
providing HA services.
A symptom of the notification failure can be seen in syslog when a node fails:
Oct 28 16:15:51 clusternode27 clurgmgrd[8602]: debug Membership Change Event
Oct 28 16:15:51 clusternode27 fenced[2760]: clusternode30 not a
cluster member after 0 sec post_fail_delay
Oct 28 16:15:51 clusternode27 fenced[2760]: fencing node clusternode30
Oct 28 16:15:51 clusternode27 fenced[2760]: can't get node number for
node pC9@#001
Oct 28 16:15:51 clusternode27 fenced[2760]: fence clusternode30 success
Oct 28 16:15:56 clusternode27 clurgmgrd[8602]: info Waiting for node
#30 to be fenced
On line 4, Can't get node number... should report the name
clusternode30, but the bug involves incorrectly
freeing the memory holding this name before using it in this message.
While fenced is satisfied that the
node has been successfully fenced, in the final line we can see that
the resource manager clurgmgrd
is still awaiting notification that the fencing has taken place.
This bug does not affect the RHEL5 version of the package, which is
based on a different source tree.
However, all distributions based on the STABLE2 branch are likely to
be affected.
The cluster used to expose the bug consists of 3 VM guests with the
following cluster configuration:
?xml version=1.0?
cluster name=testcluster config_version=28
cman port=6809
multicast addr=224.0.0.1/
/cman
fencedevices
fencedevice agent=fence_ack_null name=fan01/
/fencedevices
clusternodes
clusternode name=clusternode27 nodeid=27
multicast addr=224.0.0.1 interface=eth0:1/
fence
method name=1
device name=fan01/
/method
/fence
/clusternode
clusternode name=clusternode28 nodeid=28
multicast addr=224.0.0.1 interface=eth0:1/
fence
method name=1
device name=fan01/
/method
/fence
/clusternode
clusternode name=clusternode30 nodeid=30
multicast addr=224.0.0.1 interface=eth0:1/
fence
method name=1
device name=fan01/
/method
/fence
/clusternode
/clusternodes
rm log_level=7
failoverdomains
failoverdomain name=new_cluster_failover nofailback=1
ordered=0 restricted=1
failoverdomainnode name=clusternode27 priority=1/
failoverdomainnode name=clusternode28 priority=1/
failoverdomainnode name=clusternode30 priority=1/
/failoverdomain
/failoverdomains
resources
script name=sentinel file=/bin/true/
/resources
service autostart=0 exclusive=0 name=SENTINEL recovery=disable
script ref=sentinel/
/service
/rm
/cluster
Where clusternode27, clusternode28, and clusternode30 are the 3 node names.
The only non-standard component is a dummy fencing agent installed in
/usr/sbin/fence_ack_null:
#!/bin/bash
#
# Fencing agent that always succeeds
#
echo Done
# eof
The cluster can now be started. On each node:
sudo /etc/init.d/cman start
sudo /etc/init.d/rgmanager start
Once all nodes are running, start the SENTINEL service on clusternode30:
sudo /usr/sbin/clusvcadm -e SENTINEL -m clusternode30
On clusternode27, view the cluster status. Service SENTINEL should be
in state started on clusternode30.
sudo /usr/sbin/clustat
At this point, tail syslog on clusternode27 and clusternode28.
Power-off clusternode30 (as rudely as possible).
On clusternode27, view the status of the cluster again:
sudo /usr/sbin/clustat
This will show that clusternode30 if Offline, but that
service:SENTINEL is still in state started on clusternode30.
Again on clusternode27, view the node status:
sudo /usr/sbin/cman_tool -f nodes
Clusternode30 will show as status X, with a note saying Node has
not been fenced since it went down.
-- System Information:
Debian Release: 5.0.3
APT prefers stable
APT policy: (500, 'stable')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.26-2-amd64 (SMP w/1 CPU core)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org