Re: [Openais] Strange corosync ring failure

Steven Dake Mon, 11 Jan 2010 10:31:40 -0800

Are you using network manager?

If you still have corosync running, you can killall -SEGV corosync and a
event trace will be generated.  This event trace can be printed with
corosync-fplay.  The thing I would be interested to see is if one of
your network interfaces became bound to 127.0.0.1.  This might be in
your syslog as well.


Regards
-steve


On Mon, 2010-01-11 at 10:32 +0100, Eelco Jepkema wrote:
> Hi all,
> 
> I came back from my weekend to find my 2-node cluster in a state I 
> didn't expect. I'm hoping someone will have insight in what might have 
> happened.
> 
> 
> Situation:
> 2-node cluster with node1 and node2 connected through 2 corosync rings 
> over internal network 192.168.11.0 and external network 77.xxx.xxx.xxx 
> using rrp_mode active. Full corosync.conf at the end of the mail for 
> reference:
> 
> 
> node1:~# corosync-cfgtool -s
> Printing ring status.
> Local node ID 508223053
> RING ID 0
>       id      = 77.xxx.xxx.xxx
>       status  = Incrementing problem counter for seqid 19241103 iface 
> 77.xxx.xxx.xxx to [1 of 10]
> RING ID 1
>       id      = 192.168.11.8
>       status  = Marking seqid 19241103 ringid 1 interface 192.168.11.8 FAULTY 
> - adminisrtative intervention required.
> 
> 
> 
> On node2 however all still seems well:
> 
> node2:~# corosync-cfgtool -s
> Printing ring status.
> Local node ID 525000269
> RING ID 0
>       id      = 77.xxx.xxx.xxx
>       status  = ring 0 active with no faults
> RING ID 1
>       id      = 192.168.11.9
>       status  = ring 1 active with no faults
> 
> 
> 
> Now I've tried to re-enable the rings on node1
> 
> node1:~# corosync-cfgtool -r
> 
> 'Hangs' and doesn't do anything for a couple of minutes (after which I 
> killed it).
> 
> 
> 
>  From /var/log/syslog on node1 at the time all failed:
> 
> Jan 11 02:00:28 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241075 iface 192.168.11.8 to [1 of 10]
> Jan 11 02:00:29 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241077 iface 192.168.11.8 to [2 of 10]
> Jan 11 02:00:29 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241079 iface 192.168.11.8 to [3 of 10]
> Jan 11 02:00:30 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241081 iface 192.168.11.8 to [4 of 10]
> Jan 11 02:00:30 node1 corosync[5712]:   [TOTEM ] Decrementing problem 
> counter for iface 192.168.11.8 to [3 of 10]
> Jan 11 02:00:30 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241083 iface 192.168.11.8 to [4 of 10]
> Jan 11 02:00:31 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241085 iface 192.168.11.8 to [5 of 10]
> Jan 11 02:00:32 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241087 iface 192.168.11.8 to [6 of 10]
> Jan 11 02:00:32 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241089 iface 192.168.11.8 to [7 of 10]
> Jan 11 02:00:32 node1 corosync[5712]:   [TOTEM ] Decrementing problem 
> counter for iface 192.168.11.8 to [6 of 10]
> Jan 11 02:00:33 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241091 iface 192.168.11.8 to [7 of 10]
> Jan 11 02:00:33 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241093 iface 192.168.11.8 to [8 of 10]
> Jan 11 02:00:34 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241095 iface 192.168.11.8 to [9 of 10]
> Jan 11 02:00:34 node1 corosync[5712]:   [TOTEM ] Decrementing problem 
> counter for iface 192.168.11.8 to [8 of 10]
> Jan 11 02:00:35 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241097 iface 192.168.11.8 to [9 of 10]
> Jan 11 02:00:36 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241099 iface 192.168.11.8 to [10 of 10]
> Jan 11 02:00:36 node1 corosync[5712]:   [TOTEM ] Marking seqid 19241099 
> ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention 
> required.
> Jan 11 02:00:39 node1 corosync[5712]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241103 iface 77.xxx.xxx.xxx to [1 of 10]
> Jan 11 02:00:39 node1 corosync[5712]:   [TOTEM ] Marking seqid 19241103 
> ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention 
> required.
> 
> 
> So this logging tells me corosync incremented the problem counter on 
> ring 1 to 10/10 and marked it faulty (probably correctly due to some 
> kind of network issue) but it also incremented problem counter on ring 0 
> to 1/10 and after this all logging from corosync stops.
> 
> 
> 
>  From /var/log/syslog on node2:
> Jan 11 02:00:34 node2 corosync[11210]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241096 iface 77.xxx.xxx.xxx to [1 of 10]
> Jan 11 02:00:36 node2 corosync[11210]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241100 iface 192.168.11.9 to [1 of 10]
> Jan 11 02:00:36 node2 corosync[11210]:   [TOTEM ] ring 0 active with no 
> faults
> Jan 11 02:00:36 node2 corosync[11210]:   [TOTEM ] ring 1 active with no 
> faults
> Jan 11 02:00:39 node2 corosync[11210]:   [TOTEM ] A processor failed, 
> forming new configuration.
> Jan 11 02:00:39 node2 corosync[11210]:   [TOTEM ] Incrementing problem 
> counter for seqid 19241102 iface 192.168.11.9 to [1 of 10]
> Jan 11 02:00:41 node2 corosync[11210]:   [TOTEM ] ring 1 active with no 
> faults
> 
> 
> 
> So node2 also sees some strang thing shappening but doesn't mark 
> anything as FAULTY. The one strange line I see in this log is:
> 
> Jan 11 02:00:39 node2 corosync[11210]:   [TOTEM ] A processor failed, 
> forming new configuration.
> 
> But I don't know exactly what to make of this.
> 
> If anyone can shed some light into what happened and how I could 
> possibly fix this for the future I would be very grateful. Thanks in 
> advance.
> 
> Best regards,
> Eelco Jepkema
> 
> 
> 
> 
> corosync.conf:
> 
> # Please read the openais.conf.5 manual page
> 
> totem {
>       version: 2
> 
>       # How long before declaring a token lost (ms)
>       token: 3000
> 
>       # How many token retransmits before forming a new configuration
>       token_retransmits_before_loss_const: 10
> 
>       # How long to wait for join messages in the membership protocol (ms)
>       join: 60
> 
>       # How long to wait for consensus to be achieved before starting a new 
> round of membership configuration (ms)
>       consensus: 1500
> 
>       # Turn off the virtual synchrony filter
>       vsftype: none
> 
>       # Number of messages that may be sent by one processor on receipt of 
> the token
>       max_messages: 20
> 
>       # Limit generated nodeids to 31-bits (positive signed integers)
>       clear_node_high_bit: yes
> 
>       # Disable encryption
>       secauth: off
> 
>       # How many threads to use for encryption/decryption
>       threads: 0
> 
>       # Optionally assign a fixed node id (integer)
>       # nodeid: 1234
> 
>       # This specifies the mode of redundant ring, which may be none, active, 
> or passive.
>       rrp_mode: active
> 
>       interface {
>               # The following values need to be set based on your environment
>               ringnumber: 0
>               bindnetaddr: 77.xxx.xxx.0
>               mcastaddr: 226.94.1.3
>               mcastport: 5405
>       }
> 
>       interface {
>               # The following values need to be set based on your environment
>               ringnumber: 1
>               bindnetaddr: 192.168.11.0
>               mcastaddr: 239.255.1.2
>               mcastport: 5406
>       }
> }
> 
> amf {
>       mode: disabled
> }
> 
> service {
>       # Load the Pacemaker Cluster Resource Manager
>       ver:       0
>       name:      pacemaker
> }
> 
> aisexec {
>          user:   root
>          group:  root
> }
> 
> logging {
>          fileline: off
>          to_stderr: yes
>          to_logfile: no
>          to_syslog: yes
>       syslog_facility: daemon
>          debug: off
>          timestamp: on
>          logger_subsys {
>                  subsys: AMF
>                  debug: off
>                  tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>          }
> }
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Strange corosync ring failure

Reply via email to