On Thu, Apr 10, 2008 at 07:30:31PM +0200, Bernd Schubert wrote: > On Thursday 10 April 2008 12:48:27 Lars Ellenberg wrote: > > On Wed, Apr 09, 2008 at 06:34:39PM +0200, Lars Marowsky-Bree wrote: > > > On 2008-04-08T19:32:58, Bernd Schubert <[EMAIL PROTECTED]> wrote: > > > > Hello, > > > > > > > > I need to set a rather huge dead time of 1200s, but the initial dead > > > > time is supposed to be of 120s or less. However, heartbeat tries to be > > > > schoolmasterly and doesn't want to accept my settings: > > > > > > > > deadtime 1200 # time to declare a node dead > > > > initdead 120 # time to declare a node dead on heartbeat startup > > > > keepalive 120 # how often to send keepalive packets > > > > > > Algorithmic reasons require that initdead be larger than deadtime. > > > > > > keepalive every two minutes and deadtime at 20 minutes is exceptional. > > > > > > Not even Lustre should create a load so high that a realtime priority > > > thread which is entirely locked into memory is not reliably scheduled > > > for 20 minutes at a stretch! > > > > Bernd, are you sure that heartbeat is not scheduled, > > or is it possible that the heartbeat UDP packets just fall on the floor > > because of memory pressure and network congestion, and maybe even > > only heartbeating on the client data network? > > I can exclude network congestion, since Lustre goes over Infiniband, while > heartbeats goes over two independent IP connections, one of these is a direct > server-to-server connection. > > > > > what I would find out first: is heartbeat not scheduled, > > or do the heartbeats get lost (as you know, udp is unreliable). > > It is rather probable heartbeat is just not scheduled, since even simple > shell > commands then hang. I already analyzed the kernel trances when Lustre and > Linux-md are at high load - almost everything is in wait_for_completion(), > schedule_timeout() and get_active_stripe() then.
hm. does heartbeat get the realtime priority? "even simple shell commands hang", well, yes, sure. but, did you try a realtime prio mlocked busybox? I'm more than just curious here, I really want to know. We are DRBD, and Lustre and DRBD teamed up is a very appealing storage architecture. Unfortunately I don't have the infrastructure at hand (yet) to play with this in our Lab, so please keep me posted on any real-life experience with Lustre. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
