Re: [Openais] How to tune corosync heartbeat timer ?

Steven Dake Wed, 02 Jun 2010 08:35:01 -0700

On 06/02/2010 01:19 AM, Alain.Moulle wrote:
> Hi Steven,
>
> have you got a formula to calculate the timeout with regard to
> token,token_retransmits_before_loss_const , and
> consensus values ?
>


I recommend:

token = length to detect failed node
consensus = 2.2 * token
join = 150msec
retransmits_before_loss_const = 25 for heavy loaded networks, 4-5 for 
low load networks

Regards
-steve

> and is there any risk on corosync behavior, stability, etc. if we
> increase this time to around 45s / 60s ?

No risk to stability, although failure detection time will be long.

>
> does anybody have experienced ?
>
> Thanks
> Regards
> Alain
>> Hi Steven,
>> I've git it a try :
>> the values of token=45000 and token_retransmits_before_loss_const=45 leads
>>
>> to also set consensus=54000 (at least 1,2 * token) otherwise corosync
>> start fails. With these values, when I do ifdown eth0 on one node, in
>> fact it takes around 98s
>> for this node to appear OFFLINE on crm_mon on the healthy node, so I don't
>> exactly know which is the formula ?
>>
>> Thanks
>> Regards
>> Alain
>>
>>
>>
>>     token: 45000
>>     token_retransmits_before_loss_const: 45
>>
>>       On Wed, 2010-05-19 at 08:39 +0200, Alain.Moulle wrote:
>>
>>
>>         Hi Steven
>>         in fact, I 've at first post this question on the Pacemaker ML,
>>         but there is no way in Pacemaker to increase this time, and
>>         I think it is normal as the "cluster manager" part is provided
>>
>>
>>         by corosync, managing the heartbeat. My concern is to largely
>>         increase this time, until even values
>>
>>         as 45s, this is not a problem for applications I have to manage,
>>
>>
>>         but 10s is really a big problem for me, in case of network
>>         problem which lead to silence on heartbeat for a while. So,
>>         based on your experience, which parameters do you think I can
>>         try to increase to get this 45s timeout ?
>>
>>         Thanks a lot.
>>         Regards
>>         Alain
>>
>>
>>             On Mon, 2010-05-17 at 08:25 +0200, Alain.Moulle wrote:
>>
>>
>>                     Hi again,
>>
>>                     I 've checked the man corosync.conf and seen many 
>> parameters
>>                     around token timers etc. but I can't see how to increase 
>> the heartbeat
>>                     timeout. When testing, it occurs that timeout is between 
>> 10s and 12s
>>                     before a node decides to fence another one in the 
>> cluster (when for
>>                     example I force a if down eth0 on this node to simulate 
>> Heartbeat failure).
>>                     But I can't see which parameter(s) to tune in 
>> corosync.conf to increase
>>                     these 10 or 12s ...
>>
>>                     Any tip would be appreciated...
>>                     Thanks
>>                     Alain
>>
>>
>>             Alain,
>>
>>             I don't have a direct answer to your question.  Corosync detects 
>> a
>>             failure of any node in "token" msec.  I have not measured how 
>> long
>>             qpid/fencing/pacemaker/rgmanager/gfs/ocfs/etc take to operate on 
>> this
>>             notification.  This delta between failure detection and recovery 
>> would
>>             be a good question to potentially ask on the pacemaker ml.
>>
>>             In my test environments I run at token = 1000 msec.  Totem can 
>> be tuned
>>             to lower values, but under a heavy network load, may falsely 
>> detect a
>>             node failure.
>>
>>             Most products that use Corosync ship with a 10000msec (10sec) or 
>> larger
>>             token value to offer least chance of false node detection.
>>
>>             The token timer is just one consideration, however.  The
>>             "token_retransmits_before_loss_const" defaults to 4.  This may 
>> be too
>>             low in lossy or heavy load networks.  A higher value for this
>>             configuration produces a bit more load but more resilient 
>> behavior.
>>
>>             Regards
>>             -steve
>>
>>
>>
>>
>>         _______________________________________________
>>         Openais mailing list
>>         [email protected]
>>         https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>>
>>
>>
>>
>
>
> _______________________________________________
> Openais mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] How to tune corosync heartbeat timer ?

Reply via email to