Good morning, I am not subscribed to the list (yet, waiting on confirmation) so please CC me on all replies.
My employer has several deployments of Pacemaker on top of Corosync and we have recently been hitting this: Jul 18 12:01:05 xxxx corosync[6065]: [TOTEM ] FAILED TO RECEIVE Jul 18 12:01:15 xxxx corosync[6065]: last message repeated 15 times Jul 18 12:01:15 xxxx corosync[6065]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 268: memb=1, new=0, lost=4 Every node in the cluster, every single one (not just the failed node), then considers itself DC in Pacemaker and all of them have a no-quorum event. There is no explanation for this anywhere aside from the source, which I'm afraid I haven't devoted enough time to to understand. I see the failure at exec/totemsrp.c:3548, but I have no idea why we're hitting it. I did see this: http://marc.info/?l=openais&m=131074804115507&w=2 Which sounds promising. I noticed Corosync 1.4.0 just shipped, too, is this something that would resolve this issue? I can provide details of our switch hardware off-list if it is necessary, I'm just looking for a little bit of guidance here (and some light shed on why this occurs). I appreciate the assistance, -- Jed Smith [email protected] _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
