Greg Woods <wo...@ucar.edu> writes:

> For the second time in a few weeks, we have had one node of a particular
> cluster getting fenced. It isn't totally clear why this is happening. On
> the surviving node I see:
>
> Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence 
> (reboot) vmc2.ucar.edu: static-list
> Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence 
> (reboot) vmc2.ucar.edu: static-list
> Feb  2 16:49:00 vmc1 kernel: igb 0000:03:00.1 eth3: igb: eth3 NIC Link is Down
> Feb  2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state
> Feb  2 16:49:01 vmc1 corosync[2846]:   [TOTEM ] A processor failed, forming 
> new configuration.
>
> OK, so from this point of view, it looks like the link was lost
> between the two hosts, resulting in fencing.

I'd say the other way around: the fencing resulted in a link loss.

> Feb  2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
> Feb  2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
> Feb  2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
> Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
> Feb  2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode
> Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
> Feb  2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
> Feb  2 16:46:49 vmc2 crmd[4191]:   notice: State transition S_IDLE -> 
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts 
> for: fail-count-VM-radnets (1)
> Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 37: 
> fail-count-VM-radnets=1
> Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts 
> for: last-failure-VM-radnets (1486079209)
> Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 39: 
> last-failure-VM-radnets=1486079209
> Feb  2 16:46:50 vmc2 pengine[4190]:   notice: On loss of CCM Quorum: Ignore
> Feb  2 16:46:50 vmc2 pengine[4190]:  warning: Processing failed op monitor 
> for VM-radnets on vmc2.ucar.edu: not running (7)

Looks like your VM resource was destroyed (maybe due to the xen balloon
errors above), and the monitor operation noticed this.

> Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Recover VM-radnets#011(Started 
> vmc2.ucar.edu)
> Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Calculated Transition 2914: 
> /var/lib/pacemaker/pengine/pe-input-317.bz2
> Feb  2 16:46:50 vmc2 crmd[4191]:   notice: Initiating action 15: stop 
> VM-radnets_stop_0 on vmc2.ucar.edu (local)
> Feb  2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will be 
> stopped (timeout: 80s)

If that stop operation failed for any reason, fencing could be expected.

> Feb  2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode
> Feb  2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not 
> ready
> Feb  2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10, 
> protocol 1 (x86_64-abi)
> Feb  2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready
> Feb  2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link 
> becomes ready
> Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state
> Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state
> Feb  2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state
>
>  (and then there are a bunch of null bytes, and the log resumes with reboot)

Remote logging help much with such issues.
-- 
Feri
_______________________________________________
Linux-HA mailing list is closing down.
Please subscribe to us...@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
_______________________________________________
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha

Reply via email to