[Openais] Strange corosync ring failure

Eelco Jepkema Mon, 11 Jan 2010 01:34:52 -0800

Hi all,

I came back from my weekend to find my 2-node cluster in a state I 
didn't expect. I'm hoping someone will have insight in what might have 
happened.



Situation:
2-node cluster with node1 and node2 connected through 2 corosync rings 
over internal network 192.168.11.0 and external network 77.xxx.xxx.xxx 
using rrp_mode active. Full corosync.conf at the end of the mail for 
reference:


node1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 508223053
RING ID 0
        id      = 77.xxx.xxx.xxx
        status  = Incrementing problem counter for seqid 19241103 iface 
77.xxx.xxx.xxx to [1 of 10]
RING ID 1
        id      = 192.168.11.8
        status  = Marking seqid 19241103 ringid 1 interface 192.168.11.8 FAULTY 
- adminisrtative intervention required.



On node2 however all still seems well:

node2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 525000269
RING ID 0
        id      = 77.xxx.xxx.xxx
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.11.9
        status  = ring 1 active with no faults



Now I've tried to re-enable the rings on node1

node1:~# corosync-cfgtool -r

'Hangs' and doesn't do anything for a couple of minutes (after which I 
killed it).



 From /var/log/syslog on node1 at the time all failed:

Jan 11 02:00:28 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241075 iface 192.168.11.8 to [1 of 10]
Jan 11 02:00:29 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241077 iface 192.168.11.8 to [2 of 10]
Jan 11 02:00:29 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241079 iface 192.168.11.8 to [3 of 10]
Jan 11 02:00:30 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241081 iface 192.168.11.8 to [4 of 10]
Jan 11 02:00:30 node1 corosync[5712]:   [TOTEM ] Decrementing problem 
counter for iface 192.168.11.8 to [3 of 10]
Jan 11 02:00:30 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241083 iface 192.168.11.8 to [4 of 10]
Jan 11 02:00:31 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241085 iface 192.168.11.8 to [5 of 10]
Jan 11 02:00:32 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241087 iface 192.168.11.8 to [6 of 10]
Jan 11 02:00:32 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241089 iface 192.168.11.8 to [7 of 10]
Jan 11 02:00:32 node1 corosync[5712]:   [TOTEM ] Decrementing problem 
counter for iface 192.168.11.8 to [6 of 10]
Jan 11 02:00:33 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241091 iface 192.168.11.8 to [7 of 10]
Jan 11 02:00:33 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241093 iface 192.168.11.8 to [8 of 10]
Jan 11 02:00:34 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241095 iface 192.168.11.8 to [9 of 10]
Jan 11 02:00:34 node1 corosync[5712]:   [TOTEM ] Decrementing problem 
counter for iface 192.168.11.8 to [8 of 10]
Jan 11 02:00:35 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241097 iface 192.168.11.8 to [9 of 10]
Jan 11 02:00:36 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241099 iface 192.168.11.8 to [10 of 10]
Jan 11 02:00:36 node1 corosync[5712]:   [TOTEM ] Marking seqid 19241099 
ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention 
required.
Jan 11 02:00:39 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
counter for seqid 19241103 iface 77.xxx.xxx.xxx to [1 of 10]
Jan 11 02:00:39 node1 corosync[5712]:   [TOTEM ] Marking seqid 19241103 
ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention 
required.


So this logging tells me corosync incremented the problem counter on 
ring 1 to 10/10 and marked it faulty (probably correctly due to some 
kind of network issue) but it also incremented problem counter on ring 0 
to 1/10 and after this all logging from corosync stops.



 From /var/log/syslog on node2:
Jan 11 02:00:34 node2 corosync[11210]:   [TOTEM ] Incrementing problem 
counter for seqid 19241096 iface 77.xxx.xxx.xxx to [1 of 10]
Jan 11 02:00:36 node2 corosync[11210]:   [TOTEM ] Incrementing problem 
counter for seqid 19241100 iface 192.168.11.9 to [1 of 10]
Jan 11 02:00:36 node2 corosync[11210]:   [TOTEM ] ring 0 active with no 
faults
Jan 11 02:00:36 node2 corosync[11210]:   [TOTEM ] ring 1 active with no 
faults
Jan 11 02:00:39 node2 corosync[11210]:   [TOTEM ] A processor failed, 
forming new configuration.
Jan 11 02:00:39 node2 corosync[11210]:   [TOTEM ] Incrementing problem 
counter for seqid 19241102 iface 192.168.11.9 to [1 of 10]
Jan 11 02:00:41 node2 corosync[11210]:   [TOTEM ] ring 1 active with no 
faults



So node2 also sees some strang thing shappening but doesn't mark 
anything as FAULTY. The one strange line I see in this log is:

Jan 11 02:00:39 node2 corosync[11210]:   [TOTEM ] A processor failed, 
forming new configuration.

But I don't know exactly what to make of this.

If anyone can shed some light into what happened and how I could 
possibly fix this for the future I would be very grateful. Thanks in 
advance.

Best regards,
Eelco Jepkema




corosync.conf:

# Please read the openais.conf.5 manual page

totem {
        version: 2

        # How long before declaring a token lost (ms)
        token: 3000

        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 10

        # How long to wait for join messages in the membership protocol (ms)
        join: 60

        # How long to wait for consensus to be achieved before starting a new 
round of membership configuration (ms)
        consensus: 1500

        # Turn off the virtual synchrony filter
        vsftype: none

        # Number of messages that may be sent by one processor on receipt of 
the token
        max_messages: 20

        # Limit generated nodeids to 31-bits (positive signed integers)
        clear_node_high_bit: yes

        # Disable encryption
        secauth: off

        # How many threads to use for encryption/decryption
        threads: 0

        # Optionally assign a fixed node id (integer)
        # nodeid: 1234

        # This specifies the mode of redundant ring, which may be none, active, 
or passive.
        rrp_mode: active

        interface {
                # The following values need to be set based on your environment
                ringnumber: 0
                bindnetaddr: 77.xxx.xxx.0
                mcastaddr: 226.94.1.3
                mcastport: 5405
        }

        interface {
                # The following values need to be set based on your environment
                ringnumber: 1
                bindnetaddr: 192.168.11.0
                mcastaddr: 239.255.1.2
                mcastport: 5406
        }
}

amf {
        mode: disabled
}

service {
        # Load the Pacemaker Cluster Resource Manager
        ver:       0
        name:      pacemaker
}

aisexec {
         user:   root
         group:  root
}

logging {
         fileline: off
         to_stderr: yes
         to_logfile: no
         to_syslog: yes
        syslog_facility: daemon
         debug: off
         timestamp: on
         logger_subsys {
                 subsys: AMF
                 debug: off
                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
         }
}
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

[Openais] Strange corosync ring failure

Reply via email to