Re: [Lustre-discuss] Recovery without end

Brian J. Murrell Wed, 25 Feb 2009 08:03:38 -0800

On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote:
> 
> Our /proc/sys/lustre/timeout is 1000


That's way to high.  Long recoveries are exactly the reason you don't
want this number to be huge.

>  - there has been some debate on
> this large value here, but most other installation will not run in a
> network environment with a setup as crazy as ours.

What's so crazy about your set up?  Unless your network is very flaky
and/or you have not tuned your OSSes properly, there should be no need
for such a high timeout and if there is you need to address the problems
requiring it.

> Putting the timeout
> to 100 immediately results in "Transport endpoint" errors, impossible to
> run Lustre like this.

300 is the max that we recommend and we have very large production
clusters that use such values successfully.

> Since this is a 1.6.5.1 system, I activated the adaptive timeouts  - and
> put them to equally large values,
> /sys/module/ptlrpc/parameters/at_max = 6000
> /sys/module/ptlrpc/parameters/at_history = 6000
> /sys/module/ptlrpc/parameters/at_early_margin = 50
> /sys/module/ptlrpc/parameters/at_extra = 30

This is likely not good as well.  I will let somebody more knowledgeable
about AT comment in detail though.  It's a new feature and not getting
wide use at all yet, so the real-world experience is still low.

b.

signature.asc
Description: This is a digitally signed message part

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Recovery without end

Reply via email to