Are you using network manager?
If you still have corosync running, you can killall -SEGV corosync and a
event trace will be generated. This event trace can be printed with
corosync-fplay. The thing I would be interested to see is if one of
your network interfaces became bound to 127.0.0.1. This might be in
your syslog as well.
Regards
-steve
On Mon, 2010-01-11 at 10:32 +0100, Eelco Jepkema wrote:
> Hi all,
>
> I came back from my weekend to find my 2-node cluster in a state I
> didn't expect. I'm hoping someone will have insight in what might have
> happened.
>
>
> Situation:
> 2-node cluster with node1 and node2 connected through 2 corosync rings
> over internal network 192.168.11.0 and external network 77.xxx.xxx.xxx
> using rrp_mode active. Full corosync.conf at the end of the mail for
> reference:
>
>
> node1:~# corosync-cfgtool -s
> Printing ring status.
> Local node ID 508223053
> RING ID 0
> id = 77.xxx.xxx.xxx
> status = Incrementing problem counter for seqid 19241103 iface
> 77.xxx.xxx.xxx to [1 of 10]
> RING ID 1
> id = 192.168.11.8
> status = Marking seqid 19241103 ringid 1 interface 192.168.11.8 FAULTY
> - adminisrtative intervention required.
>
>
>
> On node2 however all still seems well:
>
> node2:~# corosync-cfgtool -s
> Printing ring status.
> Local node ID 525000269
> RING ID 0
> id = 77.xxx.xxx.xxx
> status = ring 0 active with no faults
> RING ID 1
> id = 192.168.11.9
> status = ring 1 active with no faults
>
>
>
> Now I've tried to re-enable the rings on node1
>
> node1:~# corosync-cfgtool -r
>
> 'Hangs' and doesn't do anything for a couple of minutes (after which I
> killed it).
>
>
>
> From /var/log/syslog on node1 at the time all failed:
>
> Jan 11 02:00:28 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241075 iface 192.168.11.8 to [1 of 10]
> Jan 11 02:00:29 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241077 iface 192.168.11.8 to [2 of 10]
> Jan 11 02:00:29 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241079 iface 192.168.11.8 to [3 of 10]
> Jan 11 02:00:30 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241081 iface 192.168.11.8 to [4 of 10]
> Jan 11 02:00:30 node1 corosync[5712]: [TOTEM ] Decrementing problem
> counter for iface 192.168.11.8 to [3 of 10]
> Jan 11 02:00:30 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241083 iface 192.168.11.8 to [4 of 10]
> Jan 11 02:00:31 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241085 iface 192.168.11.8 to [5 of 10]
> Jan 11 02:00:32 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241087 iface 192.168.11.8 to [6 of 10]
> Jan 11 02:00:32 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241089 iface 192.168.11.8 to [7 of 10]
> Jan 11 02:00:32 node1 corosync[5712]: [TOTEM ] Decrementing problem
> counter for iface 192.168.11.8 to [6 of 10]
> Jan 11 02:00:33 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241091 iface 192.168.11.8 to [7 of 10]
> Jan 11 02:00:33 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241093 iface 192.168.11.8 to [8 of 10]
> Jan 11 02:00:34 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241095 iface 192.168.11.8 to [9 of 10]
> Jan 11 02:00:34 node1 corosync[5712]: [TOTEM ] Decrementing problem
> counter for iface 192.168.11.8 to [8 of 10]
> Jan 11 02:00:35 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241097 iface 192.168.11.8 to [9 of 10]
> Jan 11 02:00:36 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241099 iface 192.168.11.8 to [10 of 10]
> Jan 11 02:00:36 node1 corosync[5712]: [TOTEM ] Marking seqid 19241099
> ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention
> required.
> Jan 11 02:00:39 node1 corosync[5712]: [TOTEM ] Incrementing problem
> counter for seqid 19241103 iface 77.xxx.xxx.xxx to [1 of 10]
> Jan 11 02:00:39 node1 corosync[5712]: [TOTEM ] Marking seqid 19241103
> ringid 1 interface 192.168.11.8 FAULTY - adminisrtative intervention
> required.
>
>
> So this logging tells me corosync incremented the problem counter on
> ring 1 to 10/10 and marked it faulty (probably correctly due to some
> kind of network issue) but it also incremented problem counter on ring 0
> to 1/10 and after this all logging from corosync stops.
>
>
>
> From /var/log/syslog on node2:
> Jan 11 02:00:34 node2 corosync[11210]: [TOTEM ] Incrementing problem
> counter for seqid 19241096 iface 77.xxx.xxx.xxx to [1 of 10]
> Jan 11 02:00:36 node2 corosync[11210]: [TOTEM ] Incrementing problem
> counter for seqid 19241100 iface 192.168.11.9 to [1 of 10]
> Jan 11 02:00:36 node2 corosync[11210]: [TOTEM ] ring 0 active with no
> faults
> Jan 11 02:00:36 node2 corosync[11210]: [TOTEM ] ring 1 active with no
> faults
> Jan 11 02:00:39 node2 corosync[11210]: [TOTEM ] A processor failed,
> forming new configuration.
> Jan 11 02:00:39 node2 corosync[11210]: [TOTEM ] Incrementing problem
> counter for seqid 19241102 iface 192.168.11.9 to [1 of 10]
> Jan 11 02:00:41 node2 corosync[11210]: [TOTEM ] ring 1 active with no
> faults
>
>
>
> So node2 also sees some strang thing shappening but doesn't mark
> anything as FAULTY. The one strange line I see in this log is:
>
> Jan 11 02:00:39 node2 corosync[11210]: [TOTEM ] A processor failed,
> forming new configuration.
>
> But I don't know exactly what to make of this.
>
> If anyone can shed some light into what happened and how I could
> possibly fix this for the future I would be very grateful. Thanks in
> advance.
>
> Best regards,
> Eelco Jepkema
>
>
>
>
> corosync.conf:
>
> # Please read the openais.conf.5 manual page
>
> totem {
> version: 2
>
> # How long before declaring a token lost (ms)
> token: 3000
>
> # How many token retransmits before forming a new configuration
> token_retransmits_before_loss_const: 10
>
> # How long to wait for join messages in the membership protocol (ms)
> join: 60
>
> # How long to wait for consensus to be achieved before starting a new
> round of membership configuration (ms)
> consensus: 1500
>
> # Turn off the virtual synchrony filter
> vsftype: none
>
> # Number of messages that may be sent by one processor on receipt of
> the token
> max_messages: 20
>
> # Limit generated nodeids to 31-bits (positive signed integers)
> clear_node_high_bit: yes
>
> # Disable encryption
> secauth: off
>
> # How many threads to use for encryption/decryption
> threads: 0
>
> # Optionally assign a fixed node id (integer)
> # nodeid: 1234
>
> # This specifies the mode of redundant ring, which may be none, active,
> or passive.
> rrp_mode: active
>
> interface {
> # The following values need to be set based on your environment
> ringnumber: 0
> bindnetaddr: 77.xxx.xxx.0
> mcastaddr: 226.94.1.3
> mcastport: 5405
> }
>
> interface {
> # The following values need to be set based on your environment
> ringnumber: 1
> bindnetaddr: 192.168.11.0
> mcastaddr: 239.255.1.2
> mcastport: 5406
> }
> }
>
> amf {
> mode: disabled
> }
>
> service {
> # Load the Pacemaker Cluster Resource Manager
> ver: 0
> name: pacemaker
> }
>
> aisexec {
> user: root
> group: root
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: no
> to_syslog: yes
> syslog_facility: daemon
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> }
> }
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais