Good morning, I am running Corosync 1.4.0 with Pacemaker 1.0.11. I've had a reappearance of the issue that I brought to the list before. This time, however, it was triggered by external stimulus.
We use Corosync in a fairly high-pressure environment. Currently in the cluster in question I have 5 nodes with 32 configured resources. It is a symmetric cluster and it is running on top of Xen domUs. This morning, one of the containers was the target of a network attack; this started a waterfall cascade that resulted in the entire cluster becoming unable to communicate and failing. There are five nodes in this cluster. lb6 through lb10. Their hostnames are given in the log. The only information I have from Corosync is this: == LB6 == > Aug 23 07:00:46 lb6 corosync[10610]: [TOTEM ] FAILED TO RECEIVE == LB7 == > Aug 23 06:58:46 lb7 corosync[10835]: [TOTEM ] Retransmit List: bf0f bf11 > bf12 bf13 bf14 bf15 bf16 bf17 bf18 bf19 bf1a bf1b bf1c bf1d bf1e bf1f bf20 > bf21 bf22 bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d > Aug 23 06:58:46 lb7 corosync[10835]: [TOTEM ] Retransmit List: bf0f bf11 > bf12 bf13 bf14 bf15 bf16 bf17 bf18 bf19 bf1a bf1b bf1c bf1d bf1e bf1f bf20 > bf21 bf22 bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d > Aug 23 06:59:47 lb7 corosync[10835]: last message repeated 158 times > Aug 23 06:59:48 lb7 corosync[10835]: last message repeated 2 times > Aug 23 06:59:48 lb7 corosync[10835]: [TOTEM ] Retransmit List: bf0f bf11 > bf12 bf13 bf14 bf15 bf16 bf17 bf18 bf19 bf1a bf1b bf1c bf1d bf1e bf1f bf20 > bf21 bf22 bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d > Aug 23 07:00:46 lb7 corosync[10835]: last message repeated 152 times > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 2224: memb=4, new=0, lost=1 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: memb: > lb7 7 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: memb: > lb8 8 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: memb: > lb10 10 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: lost: > lb6 6 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 2224: memb=4, new=0, lost=0 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: MEMB: > lb7 7 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: MEMB: > lb8 8 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: pcmk_peer_update: MEMB: > lb10 10 > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition > Aug 23 07:00:46 lb7 corosync[10835]: [pcmk ] info: update_member: Node > 6/lb6 is now: lost == LB8 == > Aug 23 06:57:58 lb8 corosync[2014]: [TOTEM ] Retransmit List: bf21 bf22 > bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d > Aug 23 06:58:58 lb8 corosync[2014]: last message repeated 157 times > Aug 23 06:58:58 lb8 corosync[2014]: [TOTEM ] Retransmit List: bf21 bf22 > bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d > Aug 23 06:59:59 lb8 corosync[2014]: last message repeated 159 times > Aug 23 06:59:59 lb8 corosync[2014]: [TOTEM ] Retransmit List: bf21 bf22 > bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d > Aug 23 07:00:46 lb8 corosync[2014]: last message repeated 123 times > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 2224: memb=4, new=0, lost=1 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: memb: > lb7 7 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: memb: > lb8 8 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: memb: > lb10 10 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: lost: > lb6 6 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 2224: memb=4, new=0, lost=0 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: MEMB: > lb7 7 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: MEMB: > lb8 8 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: pcmk_peer_update: MEMB: > lb10 10 > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition > Aug 23 07:00:46 lb8 corosync[2014]: [pcmk ] info: update_member: Node > 6/lb6 is now: lost == LB9 == > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 2224: memb=4, new=0, lost=1 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: memb: > lb7 7 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: memb: > lb8 8 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: memb: > lb10 10 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: lost: > lb6 6 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 2224: memb=4, new=0, lost=0 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: MEMB: > lb7 7 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: MEMB: > lb8 8 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: pcmk_peer_update: MEMB: > lb10 10 > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition > Aug 23 07:00:46 lb9 corosync[13255]: [pcmk ] info: update_member: Node > 6/lb6 is now: lost == LB10 == > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 2224: memb=4, new=0, lost=1 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: memb: > lb7 7 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: memb: > lb8 8 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: memb: > lb10 10 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: lost: > lb6 6 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 2224: memb=4, new=0, lost=0 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: MEMB: > lb7 7 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: MEMB: > lb8 8 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: pcmk_peer_update: MEMB: > lb10 10 > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition > Aug 23 07:00:46 lb10 corosync[1994]: [pcmk ] info: update_member: Node > 6/lb6 is now: lost So far, that looks fine. LB6 failed to receive for whatever reason and the other load balancers isolated it. However, this turned a lot worse. By the time I logged in to look at the cluster, it had partitioned into three partitions: LB10 by itself, LB6 and LB7 in a partition, and LB8 and LB9 in a partition. None of them had quorum, obviously. I restarted Corosync on all five load balancers at once, and they instead partitioned by themselves and considered the other four offline. On LB8, only lrmd was running; Corosync and the other binaries had completely disappeared. I have a lot of this in the logs: > Aug 23 08:35:33 lb9 corosync[2050]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 5888: memb=1, new=0, lost=0 > Aug 23 08:35:33 lb9 corosync[2050]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 08:35:33 lb9 corosync[2050]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 5888: memb=1, new=0, lost=0 > Aug 23 08:35:33 lb9 corosync[2050]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 08:35:33 lb9 corosync[2050]: [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Aug 23 08:35:33 lb9 corosync[2050]: [CPG ] chosen downlist: sender r(0) > ip(192.168.255.4) ; members(old:1 left:0) > Aug 23 08:35:33 lb9 corosync[2050]: [MAIN ] Completed service > synchronization, ready to provide service. > Aug 23 08:35:36 lb9 corosync[2050]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 5892: memb=1, new=0, lost=0 > Aug 23 08:35:36 lb9 corosync[2050]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 08:35:36 lb9 corosync[2050]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 5892: memb=1, new=0, lost=0 > Aug 23 08:35:36 lb9 corosync[2050]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 08:35:36 lb9 corosync[2050]: [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Aug 23 08:35:36 lb9 corosync[2050]: [CPG ] chosen downlist: sender r(0) > ip(192.168.255.4) ; members(old:1 left:0) > Aug 23 08:35:36 lb9 corosync[2050]: [MAIN ] Completed service > synchronization, ready to provide service. > Aug 23 08:35:38 lb9 corosync[2050]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 5896: memb=1, new=0, lost=0 > Aug 23 08:35:38 lb9 corosync[2050]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 08:35:38 lb9 corosync[2050]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 5896: memb=1, new=0, lost=0 > Aug 23 08:35:38 lb9 corosync[2050]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 08:35:38 lb9 corosync[2050]: [TOTEM ] A processor joined or left the > membership and a new membership was formed. That's obviously when LB9 is partitioned by itself, but I also have this: > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 5532: memb=3, new=0, lost=0 > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: memb: > lb7 7 > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: memb: > lb8 8 > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 5532: memb=3, new=0, lost=0 > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: MEMB: > lb7 7 > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: MEMB: > lb8 8 > Aug 23 08:31:40 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 08:31:40 lb9 corosync[2041]: [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 5536: memb=3, new=0, lost=0 > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: memb: > lb7 7 > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: memb: > lb8 8 > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: memb: > lb9 9 > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 5536: memb=3, new=0, lost=0 > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: MEMB: > lb7 7 > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: MEMB: > lb8 8 > Aug 23 08:31:43 lb9 corosync[2041]: [pcmk ] info: pcmk_peer_update: MEMB: > lb9 9 > Aug 23 08:31:43 lb9 corosync[2041]: [TOTEM ] A processor joined or left the > membership and a new membership was formed. To fix this issue at all, I had to reboot all five instances. I realize this is a long message, but I have some questions. Why does Corosync get in a state where it is unable to see other nodes, even if restarted, until rebooted? Is this triggering something in Linux itself? Why does Corosync flap like that, announcing events with no change in the number of members? The bigger question, though, is why does a failing container cause a cascade failure of the entire cluster? I understand if one of the Corosync instances asserts out, but I've never seen that happen without it completely offlining the other four. Instability like this worries us. That isn't to sound demanding, as I understand that this is open-source software, but we're just confused. All of the downtime I have experienced with this product thus far has been directly attributable to Corosync, I'm afraid. Is there a way to make a failing instance -- say, an instance that suddenly starts speaking Japanese, even -- not hose the rest of the cluster? Why are the other Corosync instances so sensitive to a failing node? It doesn't seem very high-availability. Guidance on were to go from here is certainly welcome, and I appreciate any and all help. Again, we're just asking questions, and I don't mean this to sound insulting or unthankful for an otherwise great piece of software. -- Jed Smith [email protected] _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
