> I observe strange problems with fencing when a cluster loose quorum for a > short time. > > After regain quorum, fenced reports 'wait state messages', and whole > cluster is blocked waiting for fenced.
Just found the following in fenced/cpg.c: /* This is how we deal with cpg's that are partitioned and then merge back together. When the merge happens, the cpg on each side will see nodes from the other side being added, and neither side will have zero started_count. So, both sides will ignore start messages from the other side. This causes the the domain on each side to continue waiting for the missing start messages indefinately. To unblock things, all nodes from one side of the former partition need to fail. */ So the observed behavior is expected?