For the second time in a few weeks, we have had one node of a particular cluster getting fenced. It isn't totally clear why this is happening. On the surviving node I see:
Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list Feb 2 16:49:00 vmc1 kernel: igb 0000:03:00.1 eth3: igb: eth3 NIC Link is Down Feb 2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state Feb 2 16:49:01 vmc1 corosync[2846]: [TOTEM ] A processor failed, forming new configuration. OK, so from this point of view, it looks like the link was lost between the two hosts, resulting in fencing. The link is a crossover cable, so no networking hardware other than the host NICs and the cable. On the other side I see: Feb 2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state Feb 2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17) Feb 2 16:46:49 vmc2 crmd[4191]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: fail-count-VM-radnets (1) Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 37: fail-count-VM-radnets=1 Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: last-failure-VM-radnets (1486079209) Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 39: last-failure-VM-radnets=1486079209 Feb 2 16:46:50 vmc2 pengine[4190]: notice: On loss of CCM Quorum: Ignore Feb 2 16:46:50 vmc2 pengine[4190]: warning: Processing failed op monitor for VM-radnets on vmc2.ucar.edu: not running (7) Feb 2 16:46:50 vmc2 pengine[4190]: notice: Recover VM-radnets#011(Started vmc2.ucar.edu) Feb 2 16:46:50 vmc2 pengine[4190]: notice: Calculated Transition 2914: /var/lib/pacemaker/pengine/pe-input-317.bz2 Feb 2 16:46:50 vmc2 crmd[4191]: notice: Initiating action 15: stop VM-radnets_stop_0 on vmc2.ucar.edu (local) Feb 2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will be stopped (timeout: 80s) Feb 2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode Feb 2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not ready Feb 2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10, protocol 1 (x86_64-abi) Feb 2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready Feb 2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link becomes ready Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state Feb 2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state (and then there are a bunch of null bytes, and the log resumes with reboot) More messages about networking, except that xenbr1 is not the bridge device associated with the NIC in question. I don't see any reason why the link between the hosts should suddenly stop working, so I am suspecting a hardware problem that only crops up rarely (but will most likely get worse over time). Is there anything anyone can see in the log that would suggest otherwise? Thank you, --Greg _______________________________________________ Linux-HA mailing list is closing down. Please subscribe to us...@clusterlabs.org instead. http://clusterlabs.org/mailman/listinfo/users _______________________________________________ Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha