[Openais] Cluster failure revisited, after a network attack

Jed Smith Tue, 23 Aug 2011 06:18:54 -0700

Good morning,

I am running Corosync 1.4.0 with Pacemaker 1.0.11. I've had a
reappearance of the issue that I brought to the list before. This
time, however, it was triggered by external stimulus.


We use Corosync in a fairly high-pressure environment. Currently in
the cluster in question I have 5 nodes with 32 configured resources.
It is a symmetric cluster and it is running on top of Xen domUs. This
morning, one of the containers was the target of a network attack;
this started a waterfall cascade that resulted in the entire cluster
becoming unable to communicate and failing.

There are five nodes in this cluster. lb6 through lb10. Their
hostnames are given in the log.

The only information I have from Corosync is this:

== LB6 ==
> Aug 23 07:00:46 lb6 corosync[10610]:   [TOTEM ] FAILED TO RECEIVE

== LB7 ==
> Aug 23 06:58:46 lb7 corosync[10835]:   [TOTEM ] Retransmit List: bf0f bf11 
> bf12 bf13 bf14 bf15 bf16 bf17 bf18 bf19 bf1a bf1b bf1c bf1d bf1e bf1f bf20 
> bf21 bf22 bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d
> Aug 23 06:58:46 lb7 corosync[10835]:   [TOTEM ] Retransmit List: bf0f bf11 
> bf12 bf13 bf14 bf15 bf16 bf17 bf18 bf19 bf1a bf1b bf1c bf1d bf1e bf1f bf20 
> bf21 bf22 bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d
> Aug 23 06:59:47 lb7 corosync[10835]: last message repeated 158 times
> Aug 23 06:59:48 lb7 corosync[10835]: last message repeated 2 times
> Aug 23 06:59:48 lb7 corosync[10835]:   [TOTEM ] Retransmit List: bf0f bf11 
> bf12 bf13 bf14 bf15 bf16 bf17 bf18 bf19 bf1a bf1b bf1c bf1d bf1e bf1f bf20 
> bf21 bf22 bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d
> Aug 23 07:00:46 lb7 corosync[10835]: last message repeated 152 times
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 2224: memb=4, new=0, lost=1
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb7 7
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb8 8
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb10 10
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: lost: 
> lb6 6
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 2224: memb=4, new=0, lost=0
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb7 7
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb8 8
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb10 10
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition
> Aug 23 07:00:46 lb7 corosync[10835]:   [pcmk  ] info: update_member: Node 
> 6/lb6 is now: lost

== LB8 ==
> Aug 23 06:57:58 lb8 corosync[2014]:   [TOTEM ] Retransmit List: bf21 bf22 
> bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d
> Aug 23 06:58:58 lb8 corosync[2014]: last message repeated 157 times
> Aug 23 06:58:58 lb8 corosync[2014]:   [TOTEM ] Retransmit List: bf21 bf22 
> bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d
> Aug 23 06:59:59 lb8 corosync[2014]: last message repeated 159 times
> Aug 23 06:59:59 lb8 corosync[2014]:   [TOTEM ] Retransmit List: bf21 bf22 
> bf23 bf24 bf25 bf26 bf27 bf28 bf29 bf2a bf2b bf2c bf2d
> Aug 23 07:00:46 lb8 corosync[2014]: last message repeated 123 times
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 2224: memb=4, new=0, lost=1
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb7 7
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb8 8
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb10 10
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: lost: 
> lb6 6
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 2224: memb=4, new=0, lost=0
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb7 7
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb8 8
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb10 10
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition
> Aug 23 07:00:46 lb8 corosync[2014]:   [pcmk  ] info: update_member: Node 
> 6/lb6 is now: lost

== LB9 ==
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 2224: memb=4, new=0, lost=1
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb7 7
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb8 8
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb10 10
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: lost: 
> lb6 6
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 2224: memb=4, new=0, lost=0
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb7 7
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb8 8
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb10 10
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition
> Aug 23 07:00:46 lb9 corosync[13255]:   [pcmk  ] info: update_member: Node 
> 6/lb6 is now: lost

== LB10 ==
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 2224: memb=4, new=0, lost=1
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb7 7
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb8 8
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb10 10
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: lost: 
> lb6 6
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 2224: memb=4, new=0, lost=0
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb7 7
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb8 8
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb10 10
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node lb6 was not seen in the previous transition
> Aug 23 07:00:46 lb10 corosync[1994]:   [pcmk  ] info: update_member: Node 
> 6/lb6 is now: lost

So far, that looks fine. LB6 failed to receive for whatever reason and
the other load balancers isolated it. However, this turned a lot
worse.

By the time I logged in to look at the cluster, it had partitioned
into three partitions: LB10 by itself, LB6 and LB7 in a partition, and
LB8 and LB9 in a partition. None of them had quorum, obviously. I
restarted Corosync on all five load balancers at once, and they
instead partitioned by themselves and considered the other four
offline. On LB8, only lrmd was running; Corosync and the other
binaries had completely disappeared. I have a lot of this in the logs:

> Aug 23 08:35:33 lb9 corosync[2050]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 5888: memb=1, new=0, lost=0
> Aug 23 08:35:33 lb9 corosync[2050]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 08:35:33 lb9 corosync[2050]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 5888: memb=1, new=0, lost=0
> Aug 23 08:35:33 lb9 corosync[2050]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 08:35:33 lb9 corosync[2050]:   [TOTEM ] A processor joined or left the 
> membership and a new membership was formed.
> Aug 23 08:35:33 lb9 corosync[2050]:   [CPG   ] chosen downlist: sender r(0) 
> ip(192.168.255.4) ; members(old:1 left:0)
> Aug 23 08:35:33 lb9 corosync[2050]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Aug 23 08:35:36 lb9 corosync[2050]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 5892: memb=1, new=0, lost=0
> Aug 23 08:35:36 lb9 corosync[2050]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 08:35:36 lb9 corosync[2050]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 5892: memb=1, new=0, lost=0
> Aug 23 08:35:36 lb9 corosync[2050]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 08:35:36 lb9 corosync[2050]:   [TOTEM ] A processor joined or left the 
> membership and a new membership was formed.
> Aug 23 08:35:36 lb9 corosync[2050]:   [CPG   ] chosen downlist: sender r(0) 
> ip(192.168.255.4) ; members(old:1 left:0)
> Aug 23 08:35:36 lb9 corosync[2050]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Aug 23 08:35:38 lb9 corosync[2050]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 5896: memb=1, new=0, lost=0
> Aug 23 08:35:38 lb9 corosync[2050]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 08:35:38 lb9 corosync[2050]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 5896: memb=1, new=0, lost=0
> Aug 23 08:35:38 lb9 corosync[2050]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 08:35:38 lb9 corosync[2050]:   [TOTEM ] A processor joined or left the 
> membership and a new membership was formed.

That's obviously when LB9 is partitioned by itself, but I also have this:

> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 5532: memb=3, new=0, lost=0
> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb7 7
> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb8 8
> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 5532: memb=3, new=0, lost=0
> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb7 7
> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb8 8
> Aug 23 08:31:40 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 08:31:40 lb9 corosync[2041]:   [TOTEM ] A processor joined or left the 
> membership and a new membership was formed.
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 5536: memb=3, new=0, lost=0
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb7 7
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb8 8
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: memb: 
> lb9 9
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 5536: memb=3, new=0, lost=0
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb7 7
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb8 8
> Aug 23 08:31:43 lb9 corosync[2041]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
> lb9 9
> Aug 23 08:31:43 lb9 corosync[2041]:   [TOTEM ] A processor joined or left the 
> membership and a new membership was formed.

To fix this issue at all, I had to reboot all five instances.

I realize this is a long message, but I have some questions. Why does
Corosync get in a state where it is unable to see other nodes, even if
restarted, until rebooted? Is this triggering something in Linux
itself? Why does Corosync flap like that, announcing events with no
change in the number of members?

The bigger question, though, is why does a failing container cause a
cascade failure of the entire cluster? I understand if one of the
Corosync instances asserts out, but I've never seen that happen
without it completely offlining the other four. Instability like this
worries us. That isn't to sound demanding, as I understand that this
is open-source software, but we're just confused. All of the downtime
I have experienced with this product thus far has been directly
attributable to Corosync, I'm afraid.

Is there a way to make a failing instance -- say, an instance that
suddenly starts speaking Japanese, even -- not hose the rest of the
cluster? Why are the other Corosync instances so sensitive to a
failing node? It doesn't seem very high-availability.

Guidance on were to go from here is certainly welcome, and I
appreciate any and all help. Again, we're just asking questions, and I
don't mean this to sound insulting or unthankful for an otherwise
great piece of software.

-- 
Jed Smith
[email protected]
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Cluster failure revisited, after a network attack

Reply via email to