Hi, On Wed, Apr 09, 2008 at 08:26:02PM +0200, Bernd Schubert wrote: > Hello Lars, > > On Wednesday 09 April 2008 18:34:39 Lars Marowsky-Bree wrote: > > On 2008-04-08T19:32:58, Bernd Schubert <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > > > I need to set a rather huge dead time of 1200s, but the initial dead time > > > is supposed to be of 120s or less. However, heartbeat tries to be > > > schoolmasterly and doesn't want to accept my settings: > > > > > > deadtime 1200 # time to declare a node dead > > > initdead 120 # time to declare a node dead on heartbeat startup > > > keepalive 120 # how often to send keepalive packets > > > > Algorithmic reasons require that initdead be larger than deadtime. > > which algorithm are these and were can I find it in the sources?
Try to imagine what happens if one node is rebooted: it would initially wait for initdead time before considering the other node down. That would obviously be a violation of what you have given in deadtime. Lars insisted, and I now fully agree, that this is a configuration error so I'll revert the patch which I committed yesterday. There would've probably been a bit less confusion if instead of initdead there is a directive bootdelay or startdelay, i.e. a parameter with which one could specify how long it takes for the network to become functional after booting (initdead - deadtime = bootdelay). Thanks, Dejan > > keepalive every two minutes and deadtime at 20 minutes is exceptional. > > > > Not even Lustre should create a load so high that a realtime priority > > thread which is entirely locked into memory is not reliably scheduled > > for 20 minutes at a stretch! > > Actually, I don't have the slightest idea which is the correct value. > However, > 120s is not sufficient and hard shutdown of a server presently triggers > terrible hardware bugs. We can simply not afford any false resets. > > > > > (I'm not quite sure I'd consider that "HA" ... ;-) > > High failover times are not nice of course, but this is not life critical HA. > > > > > This needs to be fixed within Lustre. > > Yes, sure. > > > > > > Well, heartbeat is not startup up automatically here and even the nodes > > > are not powered on automatically after a hard reset. So when I start > > > heartbeat I'm activeley monitoring everything and there is absolutely no > > > need to let me wait at least 20min on start up. I'm even not convinced a > > > deadtime of 20min is sufficient, since this is for a Lustre cluster and > > > Lustre sometimes manages to create such a high load that nothing else > > > than the Lustre and related kernel threads do work on the system... > > > > A deadtime of 20m is not sufficient, but you worry about 20m on startup? > > Yes, because I sit at startup in front of my system and just wait for > heartbeat to finish to start the services. > I still think there is another bug in heartbeat, though. There is simply no > reason for heartbeat to wait $deadtime on initial startup of the heartbeat > services, when it knows all heartbeat nodes are are up. > If I at least could manually force it to online the nodes, I would have no > problem with an initial-deadtime == deadtime. > > > > > You're quite aware that deadtime is the time you should expect to be w/o > > service in case one node crashes, right? > > Yes, and I'm also quite aware that a false shutdown may cause a service down > time of several days. > > Thanks, > Bernd > > > -- > Bernd Schubert > Q-Leap Networks GmbH > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
