A process pause is a problem in the operating system. Totem is not designed to handle long scheduling delays by the operating system. 346 seconds is a super long process pause. My only recommendation is to shut down the cluster prior to a snapshot operation. Since the system is out of service during the period anyway (because nothing is being scheduled by the operating system) there is no harm. When I actively contributed to corosync development, I often ran totem with token of 200 msec. I recommend lower token timer values. The fact that hyper-v blocks all the operating system functionality for 350 seconds or longer seems like a very serious problem likely to blow up all kinds of fault detection timers in various software, not just Corosync.
Regards -steve Regards, -steve On Tue, Dec 22, 2015 at 8:32 AM, Fabio M. Di Nitto <[email protected]> wrote: > > > On 12/21/2015 10:32 PM, Ludovic Zammit wrote: > > Hello, > > > > I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor. > > Every day at 11PM a snapshot job save both servers. > > The snapshotting process seems to cause a loss of connectivity between > > the two nodes which results in the cluster partitioning and pacemaker to > > start services on both nodes. > > Then once the snapshotting is done, the two halves of the cluster are > > able to see each other again and pacemaker chooses one on which to run > > the services. > > Unfortunately that means that our DRBD partition has been mounted on > > both, so it now goes into « split brain mode » . > > > > > > When I was running corosync 1.4, I used to adjust the « token » variable > > in the configuration file so that both nodes would wait longer before > > detecting a loss of the other. > > > > Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the > > problem is back with a vengeance. > > > > I have tried the configuration below, with a a very high totem value, > > and that resulted in the following errors (I have since reverted that > > change): > > bad idea to increase totem timeout very high. It means that any fault > detection between nodes will take forever. > > > > > Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 > > Process pause detected for 3464149 ms, flush > > ing membership messages. > > Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 > > Process pause detected for 3464149 ms, flush > > ing membership messages. > > Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 > > Process pause detected for 3464199 ms, flush > > ing membership messages. > > > > > > What can I do to prevent the cluster splitting apart during those > > nightly snapshots? > > Either use another backup method, or you need to stop the cluster on the > VM you are about to snapshot, take the snapshot, start the cluster > again, move to the next. > > > How do I manually set a long totem timeout without breaking everything > else? > > > > The problem has nothing to do with just totem timeout, the problem is > that the VM was frozen for at least ´3464199 ms´ without being scheduled > by the hypervisor. So even a very high token timeout, would not solve > the problem of services running on that specific VM NOT being available > during the snapshot. > > Fabio > > > > > > > > > > ====================================================================== > > > > Software version: > > 2.6.32-573.7.1.el6.x86_64 > > > > corosync-2.3.5-1.el6.x86_64 > > corosynclib-2.3.5-1.el6.x86_64 > > > > pacemaker-cluster-libs-1.1.13-1.el6.x86_64 > > pacemaker-cli-1.1.13-1.el6.x86_64 > > > > kmod-microsoft-hyper-v-4.0.11-20150728.x86_64 > > microsoft-hyper-v-4.0.11-20150728.x86_64 > > > > Configuration: > > > > totem { > > version: 2 > > > > crypto_cipher: none > > crypto_hash: none > > clear_node_high_bit: yes > > cluster_name: cluster > > transport: udpu > > token: 150000 > > > > interface { > > ringnumber: 0 > > bindnetaddr: 10.200.0.2 > > mcastport: 5405 > > ttl: 1 > > } > > } > > > > nodelist { > > node { > > ring0_addr: 10.200.0.2 > > } > > > > node { > > ring0_addr: 10.200.0.3 > > } > > } > > > > logging { > > fileline: on > > to_stderr: no > > to_logfile: yes > > logfile: /var/log/cluster/corosync.log > > to_syslog: yes > > debug: off > > timestamp: on > > logger_subsys { > > subsys: QUORUM > > debug: off > > } > > } > > > > > > quorum { > > provider: corosync_votequorum > > two_node: 1 > > } > > > > > > > > Thank you for your help, > > — > > > > Ludovic Zammit > > > > > > > > > > > > > > > > > > _______________________________________________ > > discuss mailing list > > [email protected] > > http://lists.corosync.org/mailman/listinfo/discuss > > > _______________________________________________ > discuss mailing list > [email protected] > http://lists.corosync.org/mailman/listinfo/discuss >
_______________________________________________ discuss mailing list [email protected] http://lists.corosync.org/mailman/listinfo/discuss
