The switch was our first thought, but that has been swapped, and while we are not having nodes fenced anymore (we were daily), this anomoly remains.
I will ask for those logs and conf on Monday. I think it might be worth reinstalling corosync on this box anyway? Can't be healthy if it is exiting unclearly. I have has reports of the rgmanager dying on this box. (pid file but not running) Could that be related? Thanks :) On Saturday, December 10, 2011, Digimer <li...@alteeve.com> wrote: > On 12/10/2011 03:32 PM, Matthew Painter wrote: >> Hi all, >> >> We are trying to get to the bottom of some odd intermittent behavior on >> a cluster. We are intermittently seeing nodes leave and rejoin clusters, >> without being fenced. Further the gap between leaving on re-joining is 8 >> minutes. We are monitoring the latency between boxes, and it is >> acceptable (<5ms). >> >> How can nodes exhibit this behavior? There seem to be no impact on the >> services running on the box, just this leaving and re-joining. The SNMP >> messages are below. >> >> All help decoding this gratefully received! :) >> >> Thanks, >> >> Matt >> >> >> Sat Dec 10 15:22:00 GMT 2011: cluster3.localdomain >> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:2:52:23.35, >> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, >> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", >> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, >> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", >> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left" >> >> Sat Dec 10 15:30:25 GMT 2011: cluster3.localdomain >> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:3:00:48.75, >> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus, >> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain", >> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1, >> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1", >> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "joined" > > My first instinct is to point to multicast issues in your switch, but > then, I'd expect the node to get fenced. That said, any unexpected > disconnect should fire a fence, so it would seem like the node is > cleanly stopping/restarting corosync. > > Can you share your configuration and, ideally, anything in syslog from > all involved nodes starting from just before the disconnect and > continuing through to after the node rejoins? > > -- > Digimer > E-Mail: digi...@alteeve.com > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron >
-- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster