Le 11/11/2016 à 19:43, Dietmar Maurer a écrit : > On November 11, 2016 at 6:41 PM Dhaussy Alexandre > <[email protected]> wrote: >>> you lost quorum, and the watchdog expired - that is how the watchdog >>> based fencing works. >> I don't expect to loose quorum when _one_ node joins or leave the cluster. > This was probably a long time before - but I have not read through the whole > logs ... That makes no sense to me.. The fact is : everything have been working fine for weeks.
What i can see in the logs is : several reboots of cluster nodes suddently, and exactly one minute after one node joining and/or leaving the cluster. I see no problems with corosync/lrm/crm before that. This leads me to a probable network (multicast) malfunction. I did a bit of homeworks reading the wiki about ha manager.. What i understand so far, is that every state/service change from LRM must be acknowledged (cluster-wise) by CRM master. So if a multicast disruption occurs, and i assume LRM wouldn't be able talk to the CRM MASTER, then it also couldn't reset the watchdog, am i right ? Another thing ; i have checked my network configuration, the cluster ip is set on a linux bridge... By default multicast_snooping is set to 1 on linux bridge, so i think it there's a good chance this is the source of my problems... Note that we don't use IGMP snooping, it is disabled on almost all network switchs. Plus i found a post by A.Derumier (yes, 3 years old..) He did have similar issues with bridge and multicast. http://pve.proxmox.com/pipermail/pve-devel/2013-March/006678.html _______________________________________________ pve-user mailing list [email protected] http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
