Major issues: 1) Corosync reaching over 100% cpu usage. 2) Corosync unable to stop gracefully. 3) Virtual IP of a resources being assigned as the primary IP on a interface, after a cable disconnect/reconnect on that interface. The static IP on the interface shown as global secondary IP.
Use case: 1) Two nodes in a cluster. 2) Two communication paths exists between the two nodes, with “rrp_mode” set to active in corosync.conf a. One path is a back-to-back connection between the nodes. b. Second is via the LAN network switch. 3) The network cable was unplugged on one of the nodes for a while (on both the interfaces). It was reconnected after a short while. Observations: 1) Corosync service was taking 100% cpu on the node whose link was down: a. In the above scenario Corosync service could not be stopped gracefully. A SIGKILL had to be issued to stop the service. b. On this node, of the two interfaces configured in corosync.conf, one was being used for the Virtual IP’s preferred eth. i. It was observed that when the link was up after a disconnection, the primary global IP on that interface was the Virtual IP configured for a resource. ii. The static IP assigned to the interface was listed as “scope global secondary” in the output of `ip addr show`. iii. Also the Virtual IP of the resources configured in pacemaker were active on both the nodes. iv. `service network restart` also did not work. c. Coroysnc service was stopped (Killed since it could not be stopped), the network service was re-started and then corosync was re-started. All good after this. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org