hello,
i actually run a 2 node RH5.1 cluster with openais 0.80.3-13 and cman 2.0.80-1 both nodes are hosted on VMware ESX3.02 servers, fencing works fine but here's my issue : whenever I simulate the failure of a node (shut Eth0 or hard reboot), the node is fenced but it can never rejoin the cluster again. Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering COMMIT state. Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering RECOVERY state. Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] position [0] member 10.148.46.50: Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] previous ring seq 7692 rep 10.148.46.50 Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] aru c high delivered c received flag 1 Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] position [1] member 10.148.46.51: Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] previous ring seq 7688 rep 10.148.46.51 Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] aru b high delivered b received flag 1 Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] Did not need to originate any messages in recovery. Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] Sending initial ORF token Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] CLM CONFIGURATION CHANGE Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] New Configuration: Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0) ip(10.148.46.50) Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Left: Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Joined: Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] CLM CONFIGURATION CHANGE Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] New Configuration: Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0) ip(10.148.46.50) Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0) ip(10.148.46.51) Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Left: Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Joined: Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0) ip(10.148.46.51) Mar 17 14:24:32 VMClutest01 openais[1941]: [SYNC ] This node is within the primary component and will provide service. Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering OPERATIONAL state. Mar 17 14:24:32 VMClutest01 openais[1941]: [MAIN ] Killing node VMClutest02 because it has rejoined the cluster with existing state is there anything to do after a failure in one node to make it rejoing the cluster in a < clean > state ? If I try to cleanly restart note 2 with "shutdown -r now" it hangs on stopping cluster services if I hard reboot node 2 it can never rejoin cluster and log is the same as above. my cluster.conf <?xml version="1.0"?> <cluster alias="TestClu01" config_version="9" name="TestClu01"><fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="60"/> <clusternodes> <clusternode name="VMClutest01" nodeid="1" votes="1"> <fence><method name="FENCESX"><device name="ESX01"/></method> </fence> </clusternode> <clusternode name="VMClutest02" nodeid="2" votes="1"> <fence><method name="FENCESX"><device name="ESX02"/></method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice name="ESX01" agent="fence_vi3" ipaddr="10.148.45.206" port="VMClutest01" login="" passwd=" "/> <fencedevice name="ESX02" agent="fence_vi3" ipaddr="10.148.45.206" port="VMClutest02" login="" passwd=" "/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="AppCluster" ordered="0" restricted="0"> <failoverdomainnode name="VMClutest01" priority="1"/> <failoverdomainnode name="VMClutest02" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.148.46.55" monitor_link="1"/> </resources> <service autostart="1" domain="AppCluster" exclusive="0" name="AppServer" recovery="restart"> <ip ref="10.148.46.55"/> </service> </rm> <totem consensus="4800" join="1000" token="5000" token_retransmits_before_loss_const="20"/> </cluster> any idea ? Mathieu
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
