Hi Steven,
I've git it a try :
the values of token=45000 and token_retransmits_before_loss_const=45 leads
to also set consensus=54000 (at least 1,2 * token) otherwise corosync start fails. With these values, when I do ifdown eth0 on one node, in fact it takes around 98s
for this node to appear OFFLINE on crm_mon on the healthy node, so I don't
exactly know which is the formula ?

Thanks
Regards
Alain

token: 45000
token_retransmits_before_loss_const: 45

 On Wed, 2010-05-19 at 08:39 +0200, Alain.Moulle wrote:
Hi Steven
in fact, I 've at first post this question on the Pacemaker ML,
but there is no way in Pacemaker to increase this time, and
I think it is normal as the "cluster manager" part is provided
by corosync, managing the heartbeat. My concern is to largely increase this time, until even values
as 45s, this is not a problem for applications I have to manage,
but 10s is really a big problem for me, in case of network problem which lead to silence on heartbeat for a while. So, based on your experience, which parameters do you think I can try to increase to get this 45s timeout ?
Thanks a lot.
Regards
Alain
On Mon, 2010-05-17 at 08:25 +0200, Alain.Moulle wrote:
Hi again,

I 've checked the man corosync.conf and seen many parameters
around token timers etc. but I can't see how to increase the heartbeat
timeout. When testing, it occurs that timeout is between 10s and 12s
before a node decides to fence another one in the cluster (when for
example I force a if down eth0 on this node to simulate Heartbeat failure).
But I can't see which parameter(s) to tune in corosync.conf to increase
these 10 or 12s ...

Any tip would be appreciated...
Thanks
Alain
Alain,

I don't have a direct answer to your question.  Corosync detects a
failure of any node in "token" msec.  I have not measured how long
qpid/fencing/pacemaker/rgmanager/gfs/ocfs/etc take to operate on this
notification.  This delta between failure detection and recovery would
be a good question to potentially ask on the pacemaker ml.

In my test environments I run at token = 1000 msec.  Totem can be tuned
to lower values, but under a heavy network load, may falsely detect a
node failure.

Most products that use Corosync ship with a 10000msec (10sec) or larger
token value to offer least chance of false node detection.

The token timer is just one consideration, however.  The
"token_retransmits_before_loss_const" defaults to 4.  This may be too
low in lossy or heavy load networks.  A higher value for this
configuration produces a bit more load but more resilient behavior.

Regards
-steve


_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais




_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to