Greg Woods <wo...@ucar.edu> writes: > For the second time in a few weeks, we have had one node of a particular > cluster getting fenced. It isn't totally clear why this is happening. On > the surviving node I see: > > Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence > (reboot) vmc2.ucar.edu: static-list > Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence > (reboot) vmc2.ucar.edu: static-list > Feb 2 16:49:00 vmc1 kernel: igb 0000:03:00.1 eth3: igb: eth3 NIC Link is Down > Feb 2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state > Feb 2 16:49:01 vmc1 corosync[2846]: [TOTEM ] A processor failed, forming > new configuration. > > OK, so from this point of view, it looks like the link was lost > between the two hosts, resulting in fencing.
I'd say the other way around: the fencing resulted in a link loss. > Feb 2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state > Feb 2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) > Feb 2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) > Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state > Feb 2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode > Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state > Feb 2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) > Feb 2 16:46:49 vmc2 crmd[4191]: notice: State transition S_IDLE -> > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts > for: fail-count-VM-radnets (1) > Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 37: > fail-count-VM-radnets=1 > Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts > for: last-failure-VM-radnets (1486079209) > Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 39: > last-failure-VM-radnets=1486079209 > Feb 2 16:46:50 vmc2 pengine[4190]: notice: On loss of CCM Quorum: Ignore > Feb 2 16:46:50 vmc2 pengine[4190]: warning: Processing failed op monitor > for VM-radnets on vmc2.ucar.edu: not running (7) Looks like your VM resource was destroyed (maybe due to the xen balloon errors above), and the monitor operation noticed this. > Feb 2 16:46:50 vmc2 pengine[4190]: notice: Recover VM-radnets#011(Started > vmc2.ucar.edu) > Feb 2 16:46:50 vmc2 pengine[4190]: notice: Calculated Transition 2914: > /var/lib/pacemaker/pengine/pe-input-317.bz2 > Feb 2 16:46:50 vmc2 crmd[4191]: notice: Initiating action 15: stop > VM-radnets_stop_0 on vmc2.ucar.edu (local) > Feb 2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will be > stopped (timeout: 80s) If that stop operation failed for any reason, fencing could be expected. > Feb 2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode > Feb 2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not > ready > Feb 2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10, > protocol 1 (x86_64-abi) > Feb 2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready > Feb 2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link > becomes ready > Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state > Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state > Feb 2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state > > (and then there are a bunch of null bytes, and the log resumes with reboot) Remote logging help much with such issues. -- Feri _______________________________________________ Linux-HA mailing list is closing down. Please subscribe to us...@clusterlabs.org instead. http://clusterlabs.org/mailman/listinfo/users _______________________________________________ Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha