Hi all,
I came back from my weekend to find my 2-node cluster in a state I
didn't expect. I'm hoping someone will have insight in what might have
happened.
Situation:
2-node cluster with node1 and node2 connected through 2 corosync rings
over internal network 192.168.11.0 and external network 77.xxx.xxx.xxx
using rrp_mode active. Full corosync.conf at the end of the mail for
reference:
node1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 508223053
RING ID 0
id = 77.xxx.xxx.xxx
status = Incrementing problem counter for seqid 19241103 iface
77.xxx.xxx.xxx to [1 of 10]
RING ID 1
id = 192.168.11.8
status = Marking seqid 19241103 ringid 1 interface 192.168.11.8 FAULTY
- adminisrtative intervention required.
On node2 however all still seems well:
node2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 525000269
RING ID 0
id = 77.xxx.xxx.xxx
status = ring 0 active with no faults
RING ID 1
id = 192.168.11.9
status = ring 1 active with no faults
Now I've tried to re-enable the rings on node1
node1:~# corosync-cfgtool -r
'Hangs' and doesn't do anything for a couple of minutes (after which I
killed it).
From /var/log/syslog on node1 at the time all failed:
Jan 11 02:00:28 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241075 iface 192.168.11.8 to [1 of 10]
Jan 11 02:00:29 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241077 iface 192.168.11.8 to [2 of 10]
Jan 11 02:00:29 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241079 iface 192.168.11.8 to [3 of 10]
Jan 11 02:00:30 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241081 iface 192.168.11.8 to [4 of 10]
Jan 11 02:00:30 node1 corosync[5712]: [TOTEM ] Decrementing problem
counter for iface 192.168.11.8 to [3 of 10]
Jan 11 02:00:30 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241083 iface 192.168.11.8 to [4 of 10]
Jan 11 02:00:31 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241085 iface 192.168.11.8 to [5 of 10]
Jan 11 02:00:32 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241087 iface 192.168.11.8 to [6 of 10]
Jan 11 02:00:32 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241089 iface 192.168.11.8 to [7 of 10]
Jan 11 02:00:32 node1 corosync[5712]: [TOTEM ] Decrementing problem
counter for iface 192.168.11.8 to [6 of 10]
Jan 11 02:00:33 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241091 iface 192.168.11.8 to [7 of 10]
Jan 11 02:00:33 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241093 iface 192.168.11.8 to [8 of 10]
Jan 11 02:00:34 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241095 iface 192.168.11.8 to [9 of 10]
Jan 11 02:00:34 node1 corosync[5712]: [TOTEM ] Decrementing problem
counter for iface 192.168.11.8 to [8 of 10]
Jan 11 02:00:35 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241097 iface 192.168.11.8 to [9 of 10]
Jan 11 02:00:36 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241099 iface 192.168.11.8 to [10 of 10]
Jan 11 02:00:36 node1 corosync[5712]: [TOTEM ] Marking seqid 19241099
ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention
required.
Jan 11 02:00:39 node1 corosync[5712]: [TOTEM ] Incrementing problem
counter for seqid 19241103 iface 77.xxx.xxx.xxx to [1 of 10]
Jan 11 02:00:39 node1 corosync[5712]: [TOTEM ] Marking seqid 19241103
ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention
required.
So this logging tells me corosync incremented the problem counter on
ring 1 to 10/10 and marked it faulty (probably correctly due to some
kind of network issue) but it also incremented problem counter on ring 0
to 1/10 and after this all logging from corosync stops.
From /var/log/syslog on node2:
Jan 11 02:00:34 node2 corosync[11210]: [TOTEM ] Incrementing problem
counter for seqid 19241096 iface 77.xxx.xxx.xxx to [1 of 10]
Jan 11 02:00:36 node2 corosync[11210]: [TOTEM ] Incrementing problem
counter for seqid 19241100 iface 192.168.11.9 to [1 of 10]
Jan 11 02:00:36 node2 corosync[11210]: [TOTEM ] ring 0 active with no
faults
Jan 11 02:00:36 node2 corosync[11210]: [TOTEM ] ring 1 active with no
faults
Jan 11 02:00:39 node2 corosync[11210]: [TOTEM ] A processor failed,
forming new configuration.
Jan 11 02:00:39 node2 corosync[11210]: [TOTEM ] Incrementing problem
counter for seqid 19241102 iface 192.168.11.9 to [1 of 10]
Jan 11 02:00:41 node2 corosync[11210]: [TOTEM ] ring 1 active with no
faults
So node2 also sees some strang thing shappening but doesn't mark
anything as FAULTY. The one strange line I see in this log is:
Jan 11 02:00:39 node2 corosync[11210]: [TOTEM ] A processor failed,
forming new configuration.
But I don't know exactly what to make of this.
If anyone can shed some light into what happened and how I could
possibly fix this for the future I would be very grateful. Thanks in
advance.
Best regards,
Eelco Jepkema
corosync.conf:
# Please read the openais.conf.5 manual page
totem {
version: 2
# How long before declaring a token lost (ms)
token: 3000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10
# How long to wait for join messages in the membership protocol (ms)
join: 60
# How long to wait for consensus to be achieved before starting a new
round of membership configuration (ms)
consensus: 1500
# Turn off the virtual synchrony filter
vsftype: none
# Number of messages that may be sent by one processor on receipt of
the token
max_messages: 20
# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes
# Disable encryption
secauth: off
# How many threads to use for encryption/decryption
threads: 0
# Optionally assign a fixed node id (integer)
# nodeid: 1234
# This specifies the mode of redundant ring, which may be none, active,
or passive.
rrp_mode: active
interface {
# The following values need to be set based on your environment
ringnumber: 0
bindnetaddr: 77.xxx.xxx.0
mcastaddr: 226.94.1.3
mcastport: 5405
}
interface {
# The following values need to be set based on your environment
ringnumber: 1
bindnetaddr: 192.168.11.0
mcastaddr: 239.255.1.2
mcastport: 5406
}
}
amf {
mode: disabled
}
service {
# Load the Pacemaker Cluster Resource Manager
ver: 0
name: pacemaker
}
aisexec {
user: root
group: root
}
logging {
fileline: off
to_stderr: yes
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais