Re: [Linux-HA] Initial dead time is smaller than deadtime

Dejan Muhamedagic Thu, 10 Apr 2008 09:44:00 -0700

Hi,

On Wed, Apr 09, 2008 at 08:26:02PM +0200, Bernd Schubert wrote:
> Hello Lars,
> 
> On Wednesday 09 April 2008 18:34:39 Lars Marowsky-Bree wrote:
> > On 2008-04-08T19:32:58, Bernd Schubert <[EMAIL PROTECTED]> wrote:
> > > Hello,
> > >
> > > I need to set a rather huge dead time of 1200s, but the initial dead time
> > > is supposed to be of 120s or less. However, heartbeat tries to be
> > > schoolmasterly and doesn't want to accept my settings:
> > >
> > > deadtime 1200 # time to declare a node dead
> > > initdead 120  # time to declare a node dead on heartbeat startup
> > > keepalive 120 # how often to send keepalive packets
> >
> > Algorithmic reasons require that initdead be larger than deadtime.
> 
> which algorithm are these and were can I find it in the sources?


Try to imagine what happens if one node is rebooted: it would
initially wait for initdead time before considering the other
node down. That would obviously be a violation of what you have
given in deadtime.

Lars insisted, and I now fully agree, that this is a
configuration error so I'll revert the patch which I committed
yesterday.

There would've probably been a bit less confusion if instead of
initdead there is a directive bootdelay or startdelay, i.e. a
parameter with which one could specify how long it takes for the
network to become functional after booting (initdead - deadtime =
bootdelay).

Thanks,

Dejan

> > keepalive every two minutes and deadtime at 20 minutes is exceptional.
> >
> > Not even Lustre should create a load so high that a realtime priority
> > thread which is entirely locked into memory is not reliably scheduled
> > for 20 minutes at a stretch!
> 
> Actually, I don't have the slightest idea which is the correct value. 
> However, 
> 120s is not sufficient and hard shutdown of a server presently triggers 
> terrible hardware bugs. We can simply not afford any false resets.
> 
> >
> > (I'm not quite sure I'd consider that "HA" ... ;-)
> 
> High failover times are not nice of course, but this is not life critical HA.
> 
> >
> > This needs to be fixed within Lustre.
> 
> Yes, sure. 
> 
> >
> > > Well, heartbeat is not startup up automatically here and even the nodes
> > > are not powered on automatically after a hard reset. So when I start
> > > heartbeat I'm activeley monitoring everything and there is absolutely no
> > > need to let me wait at least 20min on start up. I'm even not convinced a
> > > deadtime of 20min is sufficient, since this is for a Lustre cluster and
> > > Lustre sometimes manages to create such a high load that nothing else
> > > than the Lustre and related kernel threads do work on the system...
> >
> > A deadtime of 20m is not sufficient, but you worry about 20m on startup?
> 
> Yes, because I sit at startup in front of my system and just wait for 
> heartbeat to finish to start the services. 
> I still think there is another bug in heartbeat, though. There is simply no 
> reason for heartbeat to wait $deadtime on initial startup of the heartbeat 
> services, when it knows all heartbeat nodes are are up.
> If I at least could manually force it to online the nodes, I would have no 
> problem with an initial-deadtime == deadtime.
> 
> >
> > You're quite aware that deadtime is the time you should expect to be w/o
> > service in case one node crashes, right?
> 
> Yes, and I'm also quite aware that a false shutdown may cause a service down 
> time of several days.
> 
> Thanks,
> Bernd
> 
> 
> -- 
> Bernd Schubert
> Q-Leap Networks GmbH
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Initial dead time is smaller than deadtime

Reply via email to