Hello,
I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor.
Every day at 11PM a snapshot job save both servers.
The snapshotting process seems to cause a loss of connectivity between the two
nodes which results in the cluster partitioning and pacemaker to start services
on both nodes.
Then once the snapshotting is done, the two halves of the cluster are able to
see each other again and pacemaker chooses one on which to run the services.
Unfortunately that means that our DRBD partition has been mounted on both, so
it now goes into « split brain mode » .
When I was running corosync 1.4, I used to adjust the « token » variable in the
configuration file so that both nodes would wait longer before detecting a loss
of the other.
Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the problem
is back with a vengeance.
I have tried the configuration below, with a a very high totem value, and that
resulted in the following errors (I have since reverted that change):
Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 Process
pause detected for 3464149 ms, flush
ing membership messages.
Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 Process
pause detected for 3464149 ms, flush
ing membership messages.
Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 Process
pause detected for 3464199 ms, flush
ing membership messages.
What can I do to prevent the cluster splitting apart during those nightly
snapshots?
How do I manually set a long totem timeout without breaking everything else?
======================================================================
Software version:
2.6.32-573.7.1.el6.x86_64
corosync-2.3.5-1.el6.x86_64
corosynclib-2.3.5-1.el6.x86_64
pacemaker-cluster-libs-1.1.13-1.el6.x86_64
pacemaker-cli-1.1.13-1.el6.x86_64
kmod-microsoft-hyper-v-4.0.11-20150728.x86_64
microsoft-hyper-v-4.0.11-20150728.x86_64
Configuration:
totem {
version: 2
crypto_cipher: none
crypto_hash: none
clear_node_high_bit: yes
cluster_name: cluster
transport: udpu
token: 150000
interface {
ringnumber: 0
bindnetaddr: 10.200.0.2
mcastport: 5405
ttl: 1
}
}
nodelist {
node {
ring0_addr: 10.200.0.2
}
node {
ring0_addr: 10.200.0.3
}
}
logging {
fileline: on
to_stderr: no
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
Thank you for your help,
—
Ludovic Zammit
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss