Hello Lars, On Wednesday 09 April 2008 18:34:39 Lars Marowsky-Bree wrote: > On 2008-04-08T19:32:58, Bernd Schubert <[EMAIL PROTECTED]> wrote: > > Hello, > > > > I need to set a rather huge dead time of 1200s, but the initial dead time > > is supposed to be of 120s or less. However, heartbeat tries to be > > schoolmasterly and doesn't want to accept my settings: > > > > deadtime 1200 # time to declare a node dead > > initdead 120 # time to declare a node dead on heartbeat startup > > keepalive 120 # how often to send keepalive packets > > Algorithmic reasons require that initdead be larger than deadtime.
which algorithm are these and were can I find it in the sources? > > keepalive every two minutes and deadtime at 20 minutes is exceptional. > > Not even Lustre should create a load so high that a realtime priority > thread which is entirely locked into memory is not reliably scheduled > for 20 minutes at a stretch! Actually, I don't have the slightest idea which is the correct value. However, 120s is not sufficient and hard shutdown of a server presently triggers terrible hardware bugs. We can simply not afford any false resets. > > (I'm not quite sure I'd consider that "HA" ... ;-) High failover times are not nice of course, but this is not life critical HA. > > This needs to be fixed within Lustre. Yes, sure. > > > Well, heartbeat is not startup up automatically here and even the nodes > > are not powered on automatically after a hard reset. So when I start > > heartbeat I'm activeley monitoring everything and there is absolutely no > > need to let me wait at least 20min on start up. I'm even not convinced a > > deadtime of 20min is sufficient, since this is for a Lustre cluster and > > Lustre sometimes manages to create such a high load that nothing else > > than the Lustre and related kernel threads do work on the system... > > A deadtime of 20m is not sufficient, but you worry about 20m on startup? Yes, because I sit at startup in front of my system and just wait for heartbeat to finish to start the services. I still think there is another bug in heartbeat, though. There is simply no reason for heartbeat to wait $deadtime on initial startup of the heartbeat services, when it knows all heartbeat nodes are are up. If I at least could manually force it to online the nodes, I would have no problem with an initial-deadtime == deadtime. > > You're quite aware that deadtime is the time you should expect to be w/o > service in case one node crashes, right? Yes, and I'm also quite aware that a false shutdown may cause a service down time of several days. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
