Hello,

I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor. 
Every day at 11PM a snapshot job save both servers.
The snapshotting process seems to cause a loss of connectivity between the two 
nodes which results in the cluster partitioning and pacemaker to start services 
on both nodes.
Then once the snapshotting is done, the two halves of the cluster are able to 
see each other again and pacemaker chooses one on which to run the services.
Unfortunately that means that our DRBD partition has been mounted on both, so 
it now goes into «  split brain mode » .   


When I was running corosync 1.4, I used to adjust the « token » variable in the 
configuration file so that both nodes would wait longer before detecting a loss 
of the other.

Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the problem 
is back with a vengeance.

I have tried the configuration below, with a a very high totem value, and that 
resulted in the following errors (I have since reverted that change):

Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783 Process 
pause detected for 3464149 ms, flush
ing membership messages.
Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783 Process 
pause detected for 3464149 ms, flush
ing membership messages.
Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783 Process 
pause detected for 3464199 ms, flush
ing membership messages.


What can I do to prevent the cluster splitting apart during those nightly 
snapshots? 
How do I manually set a long totem timeout without breaking everything else?




======================================================================

Software version:
2.6.32-573.7.1.el6.x86_64

corosync-2.3.5-1.el6.x86_64
corosynclib-2.3.5-1.el6.x86_64

pacemaker-cluster-libs-1.1.13-1.el6.x86_64
pacemaker-cli-1.1.13-1.el6.x86_64

kmod-microsoft-hyper-v-4.0.11-20150728.x86_64
microsoft-hyper-v-4.0.11-20150728.x86_64

Configuration:

totem {
    version: 2

    crypto_cipher: none
    crypto_hash: none
    clear_node_high_bit: yes
    cluster_name: cluster
    transport: udpu
    token: 150000

    interface {
        ringnumber: 0
        bindnetaddr: 10.200.0.2
        mcastport: 5405
        ttl: 1
    }
}

nodelist {
    node {
        ring0_addr:  10.200.0.2
    }

    node {
        ring0_addr:  10.200.0.3
    }
}

logging {
    fileline: on
    to_stderr: no
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    debug: off
    timestamp: on
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}


quorum {
    provider: corosync_votequorum
    two_node: 1
}



Thank you for your help,
— 
Ludovic Zammit






_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to