Rodrigo Borges Pereira wrote:
Hello,
I have a two node cluster that occasionally has a weird behavior. The
cluster runs a number of Xen VM's with virtual disk files on top of a DRBD
device. Every night backups are done of each of the VM's, via rsync/ssh.
Sometimes, the load this generates causes hb to try to failover. Then for
some reason it fails to do so, and stays on the primary node. So all the
VM's shutdown and then boot again, on the same node.
Ok.. for what its worth...
I've been doing a fair bit of work with DRBD under Xen, ie the domU is
running drbd.
I found that under high I/O load the DRBD subsystem would get errors
such as "Pingack did not arrive in time". Sometimes the nodes would lose
contact with one another and not automatically re-establish their link.
I tried about a bazillion different things to try to fix the problem,
from low-level network configuration, various drbd configuration
options, timeouts etc. Nothing worked.
There was one single thing which worked.
In the domU config you can set a rate limit on the virtual network
interface.
Setting this to 20MB/s fixed the problem. Yeah 20M*B*/s not 'b'.
The config looks like this:
vif = [ 'rate=20MB/s, bridge=xenbr0' ]
Since I introduced this and rolled back all of my other optimisations
and tweaking everything is *fine* with drbd.
I'm pretty sure this has to do with timeout definitions, but what would be
the best locations to tune that?
yeah I thought my problem was a timeout issue... I spent a lot of time
gradually increasing the drbd timeout values to insane levels with no luck.
TIA,
Rodrigo
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems