Package: redhat-cluster Severity: important *** Please type your report below this line ***
An unresolved upstream bug (which includes a patch) affects the usability of the cluster suite on Debian Lenny. https://bugzilla.redhat.com/show_bug.cgi?id=512512 This bug prevents the cluster from correctly handling the relocation of services that were running on a failed node. The problem is that fenced correctly handles the fencing of the failed node, but is unable to correctly communicate this to the cman module - which in turn fails to notify rgmanager that the node has been fenced and any services it was running are safe to be relocated to other nodes. This bug makes the cluster suite on Debian Lenny unusable for providing HA services. A symptom of the notification failure can be seen in syslog when a node fails: Oct 28 16:15:51 clusternode27 clurgmgrd[8602]: <debug> Membership Change Event Oct 28 16:15:51 clusternode27 fenced[2760]: clusternode30 not a cluster member after 0 sec post_fail_delay Oct 28 16:15:51 clusternode27 fenced[2760]: fencing node "clusternode30" Oct 28 16:15:51 clusternode27 fenced[2760]: can't get node number for node p<C9>@#001 Oct 28 16:15:51 clusternode27 fenced[2760]: fence "clusternode30" success Oct 28 16:15:56 clusternode27 clurgmgrd[8602]: <info> Waiting for node #30 to be fenced On line 4, "Can't get node number..." should report the name "clusternode30", but the bug involves incorrectly freeing the memory holding this name before using it in this message. While fenced is satisfied that the node has been successfully fenced, in the final line we can see that the resource manager clurgmgrd is still awaiting notification that the fencing has taken place. This bug does not affect the RHEL5 version of the package, which is based on a different source tree. However, all distributions based on the STABLE2 branch are likely to be affected. The cluster used to expose the bug consists of 3 VM guests with the following cluster configuration: <?xml version="1.0"?> <cluster name="testcluster" config_version="28"> <cman port="6809"> <multicast addr="224.0.0.1"/> </cman> <fencedevices> <fencedevice agent="fence_ack_null" name="fan01"/> </fencedevices> <clusternodes> <clusternode name="clusternode27" nodeid="27"> <multicast addr="224.0.0.1" interface="eth0:1"/> <fence> <method name="1"> <device name="fan01"/> </method> </fence> </clusternode> <clusternode name="clusternode28" nodeid="28"> <multicast addr="224.0.0.1" interface="eth0:1"/> <fence> <method name="1"> <device name="fan01"/> </method> </fence> </clusternode> <clusternode name="clusternode30" nodeid="30"> <multicast addr="224.0.0.1" interface="eth0:1"/> <fence> <method name="1"> <device name="fan01"/> </method> </fence> </clusternode> </clusternodes> <rm log_level="7"> <failoverdomains> <failoverdomain name="new_cluster_failover" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="clusternode27" priority="1"/> <failoverdomainnode name="clusternode28" priority="1"/> <failoverdomainnode name="clusternode30" priority="1"/> </failoverdomain> </failoverdomains> <resources> <script name="sentinel" file="/bin/true"/> </resources> <service autostart="0" exclusive="0" name="SENTINEL" recovery="disable"> <script ref="sentinel"/> </service> </rm> </cluster> Where clusternode27, clusternode28, and clusternode30 are the 3 node names. The only non-standard component is a dummy fencing agent installed in /usr/sbin/fence_ack_null: #!/bin/bash # # Fencing agent that always succeeds # echo Done # eof The cluster can now be started. On each node: sudo /etc/init.d/cman start sudo /etc/init.d/rgmanager start Once all nodes are running, start the SENTINEL service on clusternode30: sudo /usr/sbin/clusvcadm -e SENTINEL -m clusternode30 On clusternode27, view the cluster status. Service SENTINEL should be in state "started" on clusternode30. sudo /usr/sbin/clustat At this point, tail syslog on clusternode27 and clusternode28. Power-off clusternode30 (as rudely as possible). On clusternode27, view the status of the cluster again: sudo /usr/sbin/clustat This will show that clusternode30 if "Offline", but that service:SENTINEL is still in state "started" on clusternode30. Again on clusternode27, view the node status: sudo /usr/sbin/cman_tool -f nodes Clusternode30 will show as status "X", with a note saying "Node has not been fenced since it went down". -- System Information: Debian Release: 5.0.3 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 2.6.26-2-amd64 (SMP w/1 CPU core) Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected]

