Hi,

I’ve been researching the Corosync communication layer and would like to 
understand how to calculate the total failure timeout for a host. From what 
I’ve gathered, the relevant parameters include the base token (defined in 
corosync.conf), the runtime token timeout (runtime.config.totem.token), as well 
as token_retransmit, token_retransmit_before_loss_const, and consensus. Could 
you please clarify how these values contribute to the overall failure detection 
time?

runtime.config.totem.token = base token + (number of nodes - 2) * 
token_coefficient
Total failure detection time = runtime.config.totem.token + (token_retransmit x 
token_retransmit_before_loss_const)

consensus = 1.2 * runtime.config.totem.token

For example: 3 servers
base token (from corosync.conf) = 2000ms
runtime.config.totem.token = 2650ms
token_coefficient = 650ms
token_retransmit = 1000ms
token_retransmit_before_loss_const = 4
consensus = 3180
Are those values correct?

For example, if Server 2 goes down and the real token timeout 
(runtime.config.totem.token) is set to 2650 ms, the token is retransmitted 4 
times at 1000 ms intervals, total 4000 ms. Added together, the total failure 
timeout is 6650 ms before the node is declared failed. Is that correct?
Then how does the consensus timeout work? After the 6650 ms timeout, the node 
is declared down. Does the system need to remove the node within the 3180 ms 
consensus timeout? Is there no grace period in Corosync? Is my analysis 
correct? Thank you!

best regards,

Vicki Chen


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/developers

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to