Hi! I think I understand corosync/pacemaker a bit, but I'm wondering occasionally: Today some node rebooted (still investigating why), and I examined the syslog.
Here's an interesting example: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] New Configuration: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.31) r(1) ip(192.168.0.61) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.35) r(1) ip(192.168.0.65) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Left: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.33) r(1) ip(192.168.0.63) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.34) r(1) ip(192.168.0.64) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Joined: Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2496: memb=2, new=0, lost=2 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: memb: o1 520295596 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: memb: o5 587404460 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: lost: o3 553850028 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: lost: o4 570627244 Dec 20 10:34:57 o1 corosync[12690]: [CLM ] CLM CONFIGURATION CHANGE Dec 20 10:34:57 o1 corosync[12690]: [CLM ] New Configuration: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.31) r(1) ip(192.168.0.61) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.33) r(1) ip(192.168.0.63) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.34) r(1) ip(192.168.0.64) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.35) r(1) ip(192.168.0.65) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Left: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Joined: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.33) r(1) ip(192.168.0.63) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.34) r(1) ip(192.168.0.64) Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 2496: memb=4, new=2, lost=0 Withing one second two nodes left the cluster/ring, then joined the cluster/ring. Shouldn't the ring number increase on every change? In the very same second, three nodes left the cluster and joined again: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] CLM CONFIGURATION CHANGE Dec 20 10:34:57 o1 corosync[12690]: [CLM ] New Configuration: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.31) r(1) ip(192.168.0.61) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Left: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.33) r(1) ip(192.168.0.63) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.34) r(1) ip(192.168.0.64) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.35) r(1) ip(192.168.0.65) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Joined: Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2504: memb=1, new=0, lost=3 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: memb: o1 520295596 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: lost: o3 553850028 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: lost: o4 570627244 Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] info: pcmk_peer_update: lost: o5 587404460 Dec 20 10:34:57 o1 corosync[12690]: [CLM ] CLM CONFIGURATION CHANGE Dec 20 10:34:57 o1 corosync[12690]: [CLM ] New Configuration: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.31) r(1) ip(192.168.0.61) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.33) r(1) ip(192.168.0.63) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.34) r(1) ip(192.168.0.64) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.35) r(1) ip(192.168.0.65) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Left: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] Members Joined: Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.33) r(1) ip(192.168.0.63) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.34) r(1) ip(192.168.0.64) Dec 20 10:34:57 o1 corosync[12690]: [CLM ] r(0) ip(172.20.3.35) r(1) ip(192.168.0.65) Dec 20 10:34:57 o1 corosync[12690]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 2504: memb=4, new=3, lost=0 A moment later I saw this: Dec 20 10:34:57 rkdvmso1 kernel: [185601.044523] kernel BUG at /usr/src/packages/BUILD/ocfs2-1.6/xen/ocfs2/heartbeat.c:67! [...] Dec 20 10:34:58 o1 kernel: [185601.044674] Supported: Yes Dec 20 10:34:58 o1 kernel: [185601.044678] Dec 20 10:34:59 o1 kernel: [185601.044682] Pid: 14239, comm: ocfs2_controld. Not tainted 3.0.42-0.7-xen #1 Sun Microsystems Sun Fire X4100 Server/Sun Fire X4100 Server Dec 20 10:34:59 o1 kernel: [185601.044692] RIP: e030:[<ffffffffa06818f5>] [<ffffffffa06818f5>] ocfs2_do_node_down+0x65/0x70 [ocfs2] Dec 20 10:35:00 o1 kernel: [185601.044745] RSP: e02b:ffff880032331e18 EFLAGS: 00010246 Dec 20 10:35:00 o1 kernel: [185601.044749] RAX: 0000000000000000 RBX: ffff880032960da0 RCX: 000000000000001f Dec 20 10:35:00 o1 kernel: [185601.044753] RDX: 0000000000000000 RSI: ffff8800314b5000 RDI: 000000001f0314ac [???] (The bug messages were interleaved with cluster messages (cLVM and OCFS2 are quite chatty). Before completion, SBD kicked in:) Dec 20 10:34:59 o1 sbd: [12635]: info: Received command off from o3 on disk /dev/disk/by-id/dm-name-Shared-E1_part1 Dec 20 10:34:59 o1 sbd: [12636]: info: Received command off from o3 on disk /dev/disk/by-id/dm-name-Shared-E2_part1 Dec 20 10:34:59 o1 cluster-dlm: check_fencing_done: 0192F256F87A4E5CA69BCF2BDF7659FA check_fencing 520295596 wait add 1355810586 fail 1355996098 last 0 Dec 20 10:34:59 o1 sbd: [12635]: info: sysrq-trigger: o Dec 20 10:34:59 o1 sbd: [12636]: info: sysrq-trigger: o Dec 20 10:34:59 o1 sbd: [12635]: EMERG: Rebooting system. Reason: sbd is self-fencing (power-off) Dec 20 10:34:59 o1 sbd: [12636]: EMERG: Rebooting system. Reason: sbd is self-fencing (power-off) The following reboot also replaced the kernel 3.0.42-0.7-xen with 3.0.51-0.7.9-xen (a reboot was intended anyway, but manually ;-) (Reboot also fenced the DC, and another DC was elected) After a short wile I saw messages like these: Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #2 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #3 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #4 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #5 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #6 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #7 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #8 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #9 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #10 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #20 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #30 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #40 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #50 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #60 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #70 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #80 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #90 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #100 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #200 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #300 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #400 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #500 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #600 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:39 o1 cmirrord[16392]: [35cRf7c2] Retry #700 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:40 o1 cmirrord[16392]: [35cRf7c2] Retry #800 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:40 o1 cmirrord[16392]: [35cRf7c2] Retry #900 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Dec 20 10:42:40 o1 cmirrord[16392]: [35cRf7c2] Retry #1000 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN - OpenAIS not handling the load? Dec 20 10:42:41 o1 cmirrord[16392]: [35cRf7c2] Retry #2000 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN - OpenAIS not handling the load? Dec 20 10:42:42 o1 cmirrord[16392]: [35cRf7c2] Retry #3000 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN - OpenAIS not handling the load? Dec 20 10:42:43 o1 cmirrord[16392]: [35cRf7c2] Retry #4000 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN - OpenAIS not handling the load? Dec 20 10:42:44 o1 cluster-dlm: update_cluster: Processing membership 2536 Dec 20 10:42:44 o1 corosync[12829]: [CLM ] CLM CONFIGURATION CHANGE Dec 20 10:42:44 o1 corosync[12829]: [CLM ] New Configuration: Dec 20 10:42:44 o1 cluster-dlm: dlm_process_node: Skipped active node 520295596: born-on=2520, last-seen=2536, this-event=2536, last-event=2524 Dec 20 10:42:44 o1 corosync[12829]: [CLM ] r(0) ip(172.20.3.31) r(1) ip(192.168.0.61) Dec 20 10:42:44 o1 corosync[12829]: [CLM ] r(0) ip(172.20.3.32) r(1) ip(192.168.0.62) Dec 20 10:42:44 o1 cluster-dlm: dlm_process_node: Skipped active node 537072812: born-on=2512, last-seen=2536, this-event=2536, last-event=2524 Dec 20 10:42:44 o1 corosync[12829]: [CLM ] r(0) ip(172.20.3.34) r(1) ip(192.168.0.64) Dec 20 10:42:44 o1 corosync[12829]: [CLM ] r(0) ip(172.20.3.35) r(1) ip(192.168.0.65) Dec 20 10:42:44 o1 corosync[12829]: [CLM ] Members Left: Dec 20 10:42:44 o1 corosync[12829]: [CLM ] Members Joined: Dec 20 10:42:44 o1 corosync[12829]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2536: memb=4, new=0, lost=0 At some later time I saw the start of what I call "retransmit pyramid": Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d2d Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d2f Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d31 Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d33 Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d35 Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d37 Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d39 [...] Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d5f Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d60 Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d61 Dec 20 11:07:38 o1 corosync[12829]: [TOTEM ] Retransmit List: d62 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d68 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d69 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d6a Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d6a Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d6c Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Marking ringid 0 interface 172.20.3.31 FAULTY Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:41 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 [...] Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 0 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 0 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 0 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:42 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:43 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:43 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:43 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:43 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:43 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:44 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:44 o1 corosync[12829]: [TOTEM ] Marking ringid 0 interface 172.20.3.31 FAULTY Dec 20 11:07:44 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:44 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:44 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 0 Dec 20 11:07:44 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:44 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:45 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:07:45 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:08:12 o1 corosync[12829]: [TOTEM ] Retransmit List: d78 d79 d7a d7b d7c d7d d7e d7f d80 d81 Dec 20 11:08:12 o1 corosync[12829]: [TOTEM ] Retransmit List: d78 d79 d7a d7b d7c d7d d7e d7f d80 d81 Dec 20 11:08:12 o1 corosync[12829]: [TOTEM ] Retransmit List: d78 d79 d7a d7b d7c d7d d7e d7f d80 d81 Dec 20 11:08:12 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 1 Dec 20 11:08:12 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 1 Dec 20 11:08:12 o1 corosync[12829]: [TOTEM ] Retransmit List: d78 d79 d7a d7b d7c d7d d7e d7f d80 d81 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d96 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Marking ringid 0 interface 172.20.3.31 FAULTY Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9a Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9a Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9a Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9a Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 da4 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: da8 d79 d7a d7b d7c d7d d7e d7f d80 d81 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da9 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: da9 d79 d7a d7b d7c d7d d7e d7f d80 d81 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: da8 d79 d7a d7b d7c d7d d7e d7f d80 d81 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da9 [...] Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: da8 d79 d7a d7b d7c d7d d7e d7f d80 d81 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da9 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: da9 d79 d7a d7b d7c d7d d7e d7f d80 d81 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: da8 d79 d7a d7b d7c d7d d7e d7f d80 d81 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da9 Dec 20 11:08:13 o1 corosync[12829]: [TOTEM ] Retransmit List: da9 d79 d82 d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 da9 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 da9 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 da9 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 0 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: d9b d9f d98 d9c d9d d9e da0 da1 da2 da3 da4 da5 da6 da7 da8 da9 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: d9c d9e da1 da3 da5 da7 da9 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: d9e da3 da7 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: da3 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: da3 dcc dcd Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dcc Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dcf Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dcf Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dcf dd0 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dd0 dd1 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dd0 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dd4 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dd7 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dd9 Dec 20 11:08:14 o1 corosync[12829]: [TOTEM ] Retransmit List: dda [...] Dec 20 11:08:15 o1 corosync[12829]: [TOTEM ] Retransmit List: e74 Dec 20 11:08:15 o1 corosync[12829]: [TOTEM ] Retransmit List: e76 Dec 20 11:08:15 o1 corosync[12829]: [TOTEM ] Retransmit List: e76 e77 Dec 20 11:08:15 o1 corosync[12829]: [TOTEM ] Retransmit List: e76 e78 Dec 20 11:08:15 o1 corosync[12829]: [TOTEM ] Retransmit List: e76 Dec 20 11:08:15 o1 corosync[12829]: [TOTEM ] Retransmit List: e76 Dec 20 11:08:15 o1 corosync[12829]: [TOTEM ] Marking ringid 1 interface 192.168.0.61 FAULTY Dec 20 11:08:16 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 1 Dec 20 11:08:16 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 1 Dec 20 11:08:16 o1 corosync[12829]: [TOTEM ] Automatically recovered ring 1 Dec 20 11:08:19 o1 corosync[12829]: [TOTEM ] Retransmit List: e82 Dec 20 11:08:19 o1 corosync[12829]: [TOTEM ] Retransmit List: e85 Dec 20 11:08:24 o1 cib: [12867]: info: cib_stats: Processed 61 operations (0.00us average, 0% utilization) in the last 10min So there was a significant "blackout" of communications. I always wondered whether this is purely a software problem. At the same time I had even a longer retransmit list on another node, while some nodes showed no problem at all: Dec 20 11:08:11 o5 corosync[12677]: [TOTEM ] Retransmit List: d65 d66 d67 d68 d69 d6a d6b d6c d6d d6e d6f d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d7a d7b d7c d7d d7e d7f d80 d81 d82 Does anybody know what causes these messages? Regards, Ulrich _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
