On Thu, Apr 10, 2008 at 07:30:31PM +0200, Bernd Schubert wrote:
> On Thursday 10 April 2008 12:48:27 Lars Ellenberg wrote:
> > On Wed, Apr 09, 2008 at 06:34:39PM +0200, Lars Marowsky-Bree wrote:
> > > On 2008-04-08T19:32:58, Bernd Schubert <[EMAIL PROTECTED]> wrote:
> > > > Hello,
> > > >
> > > > I need to set a rather huge dead time of 1200s, but the initial dead
> > > > time is supposed to be of 120s or less. However, heartbeat tries to be
> > > > schoolmasterly and doesn't want to accept my settings:
> > > >
> > > > deadtime 1200 # time to declare a node dead
> > > > initdead 120  # time to declare a node dead on heartbeat startup
> > > > keepalive 120 # how often to send keepalive packets
> > >
> > > Algorithmic reasons require that initdead be larger than deadtime.
> > >
> > > keepalive every two minutes and deadtime at 20 minutes is exceptional.
> > >
> > > Not even Lustre should create a load so high that a realtime priority
> > > thread which is entirely locked into memory is not reliably scheduled
> > > for 20 minutes at a stretch!
> >
> > Bernd, are you sure that heartbeat is not scheduled,
> > or is it possible that the heartbeat UDP packets just fall on the floor
> > because of memory pressure and network congestion, and maybe even
> > only heartbeating on the client data network?
> 
> I can exclude network congestion, since Lustre goes over Infiniband, while 
> heartbeats goes over two independent IP connections, one of these is a direct 
> server-to-server connection.
> 
> >
> > what I would find out first: is heartbeat not scheduled,
> > or do the heartbeats get lost (as you know, udp is unreliable).
> 
> It is rather probable heartbeat is just not scheduled, since even simple 
> shell 
> commands then hang. I already analyzed the kernel trances when Lustre and 
> Linux-md are at high load - almost everything is in wait_for_completion(), 
> schedule_timeout() and get_active_stripe() then.

hm. does heartbeat get the realtime priority?
"even simple shell commands hang", well, yes, sure.
but, did you try a realtime prio mlocked busybox?

I'm more than just curious here, I really want to know.
We are DRBD, and Lustre and DRBD teamed up
is a very appealing storage architecture.
Unfortunately I don't have the infrastructure at hand (yet)
to play with this in our Lab, so please keep me posted on any
real-life experience with Lustre.


-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to