Hi, I’ve been researching the Corosync communication layer and would like to understand how to calculate the total failure timeout for a host. From what I’ve gathered, the relevant parameters include the base token (defined in corosync.conf), the runtime token timeout (runtime.config.totem.token), as well as token_retransmit, token_retransmit_before_loss_const, and consensus. Could you please clarify how these values contribute to the overall failure detection time?
runtime.config.totem.token = base token + (number of nodes - 2) * token_coefficient Total failure detection time = runtime.config.totem.token + (token_retransmit x token_retransmit_before_loss_const) consensus = 1.2 * runtime.config.totem.token For example: 3 servers base token (from corosync.conf) = 2000ms runtime.config.totem.token = 2650ms token_coefficient = 650ms token_retransmit = 1000ms token_retransmit_before_loss_const = 4 consensus = 3180 Are those values correct? For example, if Server 2 goes down and the real token timeout (runtime.config.totem.token) is set to 2650 ms, the token is retransmitted 4 times at 1000 ms intervals, total 4000 ms. Added together, the total failure timeout is 6650 ms before the node is declared failed. Is that correct? Then how does the consensus timeout work? After the 6650 ms timeout, the node is declared down. Does the system need to remove the node within the 3180 ms consensus timeout? Is there no grace period in Corosync? Is my analysis correct? Thank you! best regards, Vicki Chen
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/developers ClusterLabs home: https://www.clusterlabs.org/