Hi,
We have a strange case of openais (whitetank) failing in a
12-node cluster. After stopping openais on node 11, the other
nodes recover and form a 11-node cluster without any problems.
However, after stopping openais on the node 7, the cluster falls
apart, there are ten 1-node partitions.
No node enters the RECOVERY state. There are numerous messages
about tokens lost in the COMMIT state. All nodes just loop
without making any progress.
It is interesting that the nodes behave in two different ways:
the first group consisting of nodes 1-6 and the second of nodes
8-10,12. Nodes in the first group send tokens to the multicast
address once every token timeout (5s), then report that the
token was lost, then reenter the COMMIT state:
May 19 17:14:46 node02 openais[11281]: [TOTEM] entering GATHER state from 12
May 19 17:14:52 node02 openais[11281]: [TOTEM] entering GATHER state from 11
May 19 17:14:53 node02 openais[11281]: [TOTEM] Saving state aru bf high seq
received bf
May 19 17:14:53 node02 openais[11281]: [TOTEM] Storing new sequence id for ring
374e00
May 19 17:14:53 node02 openais[11281]: [TOTEM] entering COMMIT state.
May 19 17:14:58 node02 openais[11281]: [TOTEM] The token was lost in the COMMIT
state.
May 19 17:14:58 node02 openais[11281]: [TOTEM] entering GATHER state from 4.
May 19 17:14:58 node02 openais[11281]: [TOTEM] Storing new sequence id for ring
374e04
May 19 17:14:58 node02 openais[11281]: [TOTEM] entering COMMIT state.
May 19 17:15:03 node02 openais[11281]: [TOTEM] The token was lost in the COMMIT
state.
May 19 17:15:03 node02 openais[11281]: [TOTEM] entering GATHER state from 4.
...
Nodes of the second group send tokens more often (once a second),
enter only the GATHER state, then complain about the consensus
timeout expired once every 11s.
May 19 17:14:46 node10 openais[20326]: [TOTEM] entering GATHER state from 12
May 19 17:14:52 node10 openais[20326]: [TOTEM] entering GATHER state from 11
May 19 17:15:03 node10 openais[20326]: [TOTEM] The consensus timeout expired
May 19 17:15:03 node10 openais[20326]: [TOTEM] entering GATHER state from 3.
May 19 17:15:14 node10 openais[20326]: [TOTEM] The consensus timeout expired
...
This is on an IBM BladeCenter with 8-way fast nodes. There are no
errors on the network interface and looking at the tcpdump it
looks like all packets are delivered.
The totem configuration:
totem {
version: 2
token: 5000
token_retransmits_before_loss_const: 10
join: 1000
consensus: 6000
vsftype: none
max_messages: 20
send_join: 45
clear_node_high_bit: yes
secauth: off
threads: 8
interface {
ringnumber: 0
bindnetaddr: 192.168.58.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
We tried fiddling a bit with token, join, and consensus
parameters, but nothing helped.
If anybody can shed some light on the matter I'd really
appreciate. Please let me know if you need more information.
Cheers,
Dejan
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais