A process pause is a problem in the operating system.  Totem is not
designed to handle long scheduling delays by the operating system.  346
seconds is a super long process pause.  My only recommendation is to shut
down the cluster prior to a snapshot operation.  Since the system is out of
service during the period anyway (because nothing is being scheduled by the
operating system) there is no harm.  When I actively contributed to
corosync development, I often ran totem with token of 200 msec.  I
recommend lower token timer values.  The fact that hyper-v blocks all the
operating system functionality for 350 seconds or longer seems like a very
serious problem likely to blow up all kinds of fault detection timers in
various software, not just Corosync.

Regards
-steve

Regards,
-steve

On Tue, Dec 22, 2015 at 8:32 AM, Fabio M. Di Nitto <[email protected]>
wrote:

>
>
> On 12/21/2015 10:32 PM, Ludovic Zammit wrote:
> > Hello,
> >
> > I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor.
> > Every day at 11PM a snapshot job save both servers.
> > The snapshotting process seems to cause a loss of connectivity between
> > the two nodes which results in the cluster partitioning and pacemaker to
> > start services on both nodes.
> > Then once the snapshotting is done, the two halves of the cluster are
> > able to see each other again and pacemaker chooses one on which to run
> > the services.
> > Unfortunately that means that our DRBD partition has been mounted on
> > both, so it now goes into «  split brain mode » .
> >
> >
> > When I was running corosync 1.4, I used to adjust the « token » variable
> > in the configuration file so that both nodes would wait longer before
> > detecting a loss of the other.
> >
> > Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the
> > problem is back with a vengeance.
> >
> > I have tried the configuration below, with a a very high totem value,
> > and that resulted in the following errors (I have since reverted that
> > change):
>
> bad idea to increase totem timeout very high. It means that any fault
> detection between nodes will take forever.
>
> >
> > Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> > Process pause detected for 3464149 ms, flush
> > ing membership messages.
> > Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> > Process pause detected for 3464149 ms, flush
> > ing membership messages.
> > Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> > Process pause detected for 3464199 ms, flush
> > ing membership messages.
> >
> >
> > What can I do to prevent the cluster splitting apart during those
> > nightly snapshots?
>
> Either use another backup method, or you need to stop the cluster on the
> VM you are about to snapshot, take the snapshot, start the cluster
> again, move to the next.
>
> > How do I manually set a long totem timeout without breaking everything
> else?
> >
>
> The problem has nothing to do with just totem timeout, the problem is
> that the VM was frozen for at least ´3464199 ms´ without being scheduled
> by the hypervisor. So even a very high token timeout, would not solve
> the problem of services running on that specific VM NOT being available
> during the snapshot.
>
> Fabio
>
>
> >
> >
> >
> > ======================================================================
> >
> > Software version:
> > 2.6.32-573.7.1.el6.x86_64
> >
> > corosync-2.3.5-1.el6.x86_64
> > corosynclib-2.3.5-1.el6.x86_64
> >
> > pacemaker-cluster-libs-1.1.13-1.el6.x86_64
> > pacemaker-cli-1.1.13-1.el6.x86_64
> >
> > kmod-microsoft-hyper-v-4.0.11-20150728.x86_64
> > microsoft-hyper-v-4.0.11-20150728.x86_64
> >
> > Configuration:
> >
> > totem {
> >     version: 2
> >
> >     crypto_cipher: none
> >     crypto_hash: none
> >     clear_node_high_bit: yes
> >     cluster_name: cluster
> >     transport: udpu
> >     token: 150000
> >
> >     interface {
> >         ringnumber: 0
> >         bindnetaddr: 10.200.0.2
> >         mcastport: 5405
> >         ttl: 1
> >     }
> > }
> >
> > nodelist {
> >     node {
> >         ring0_addr:  10.200.0.2
> >     }
> >
> >     node {
> >         ring0_addr:  10.200.0.3
> >     }
> > }
> >
> > logging {
> >     fileline: on
> >     to_stderr: no
> >     to_logfile: yes
> >     logfile: /var/log/cluster/corosync.log
> >     to_syslog: yes
> >     debug: off
> >     timestamp: on
> >     logger_subsys {
> >         subsys: QUORUM
> >         debug: off
> >     }
> > }
> >
> >
> > quorum {
> >     provider: corosync_votequorum
> >     two_node: 1
> > }
> >
> >
> >
> > Thank you for your help,
> > —
> >
> > Ludovic Zammit
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list
> > [email protected]
> > http://lists.corosync.org/mailman/listinfo/discuss
> >
> _______________________________________________
> discuss mailing list
> [email protected]
> http://lists.corosync.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to