On 06/02/2010 01:19 AM, Alain.Moulle wrote: > Hi Steven, > > have you got a formula to calculate the timeout with regard to > token,token_retransmits_before_loss_const , and > consensus values ? >
I recommend: token = length to detect failed node consensus = 2.2 * token join = 150msec retransmits_before_loss_const = 25 for heavy loaded networks, 4-5 for low load networks Regards -steve > and is there any risk on corosync behavior, stability, etc. if we > increase this time to around 45s / 60s ? No risk to stability, although failure detection time will be long. > > does anybody have experienced ? > > Thanks > Regards > Alain >> Hi Steven, >> I've git it a try : >> the values of token=45000 and token_retransmits_before_loss_const=45 leads >> >> to also set consensus=54000 (at least 1,2 * token) otherwise corosync >> start fails. With these values, when I do ifdown eth0 on one node, in >> fact it takes around 98s >> for this node to appear OFFLINE on crm_mon on the healthy node, so I don't >> exactly know which is the formula ? >> >> Thanks >> Regards >> Alain >> >> >> >> token: 45000 >> token_retransmits_before_loss_const: 45 >> >> On Wed, 2010-05-19 at 08:39 +0200, Alain.Moulle wrote: >> >> >> Hi Steven >> in fact, I 've at first post this question on the Pacemaker ML, >> but there is no way in Pacemaker to increase this time, and >> I think it is normal as the "cluster manager" part is provided >> >> >> by corosync, managing the heartbeat. My concern is to largely >> increase this time, until even values >> >> as 45s, this is not a problem for applications I have to manage, >> >> >> but 10s is really a big problem for me, in case of network >> problem which lead to silence on heartbeat for a while. So, >> based on your experience, which parameters do you think I can >> try to increase to get this 45s timeout ? >> >> Thanks a lot. >> Regards >> Alain >> >> >> On Mon, 2010-05-17 at 08:25 +0200, Alain.Moulle wrote: >> >> >> Hi again, >> >> I 've checked the man corosync.conf and seen many >> parameters >> around token timers etc. but I can't see how to increase >> the heartbeat >> timeout. When testing, it occurs that timeout is between >> 10s and 12s >> before a node decides to fence another one in the >> cluster (when for >> example I force a if down eth0 on this node to simulate >> Heartbeat failure). >> But I can't see which parameter(s) to tune in >> corosync.conf to increase >> these 10 or 12s ... >> >> Any tip would be appreciated... >> Thanks >> Alain >> >> >> Alain, >> >> I don't have a direct answer to your question. Corosync detects >> a >> failure of any node in "token" msec. I have not measured how >> long >> qpid/fencing/pacemaker/rgmanager/gfs/ocfs/etc take to operate on >> this >> notification. This delta between failure detection and recovery >> would >> be a good question to potentially ask on the pacemaker ml. >> >> In my test environments I run at token = 1000 msec. Totem can >> be tuned >> to lower values, but under a heavy network load, may falsely >> detect a >> node failure. >> >> Most products that use Corosync ship with a 10000msec (10sec) or >> larger >> token value to offer least chance of false node detection. >> >> The token timer is just one consideration, however. The >> "token_retransmits_before_loss_const" defaults to 4. This may >> be too >> low in lossy or heavy load networks. A higher value for this >> configuration produces a bit more load but more resilient >> behavior. >> >> Regards >> -steve >> >> >> >> >> _______________________________________________ >> Openais mailing list >> [email protected] >> https://lists.linux-foundation.org/mailman/listinfo/openais >> >> >> >> >> > > > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
