Hi Steven
in fact, I 've at first post this question on the Pacemaker ML,
but there is no way in Pacemaker to increase this time, and
I think it is normal as the "cluster manager" part is provided
by corosync, managing the heartbeat.
My concern is to largely increase this time, until even values
as 45s, this is not a problem for applications I have to manage,
but 10s is really a big problem for me, in case of network
problem which lead to silence on heartbeat for a while.
So, based on your experience, which parameters do you
think I can try to increase to get this 45s timeout ?
Thanks a lot.
Regards
Alain
On Mon, 2010-05-17 at 08:25 +0200, Alain.Moulle wrote:
> Hi again,
>
> I 've checked the man corosync.conf and seen many parameters
> around token timers etc. but I can't see how to increase the heartbeat
> timeout. When testing, it occurs that timeout is between 10s and 12s
> before a node decides to fence another one in the cluster (when for
> example I force a if down eth0 on this node to simulate Heartbeat failure).
> But I can't see which parameter(s) to tune in corosync.conf to increase
> these 10 or 12s ...
>
> Any tip would be appreciated...
> Thanks
> Alain
Alain,
I don't have a direct answer to your question. Corosync detects a
failure of any node in "token" msec. I have not measured how long
qpid/fencing/pacemaker/rgmanager/gfs/ocfs/etc take to operate on this
notification. This delta between failure detection and recovery would
be a good question to potentially ask on the pacemaker ml.
In my test environments I run at token = 1000 msec. Totem can be tuned
to lower values, but under a heavy network load, may falsely detect a
node failure.
Most products that use Corosync ship with a 10000msec (10sec) or larger
token value to offer least chance of false node detection.
The token timer is just one consideration, however. The
"token_retransmits_before_loss_const" defaults to 4. This may be too
low in lossy or heavy load networks. A higher value for this
configuration produces a bit more load but more resilient behavior.
Regards
-steve
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais