On Wed, 2009-02-25 at 16:09 +0100, Thomas Roth wrote: > > Our /proc/sys/lustre/timeout is 1000
That's way to high. Long recoveries are exactly the reason you don't want this number to be huge. > - there has been some debate on > this large value here, but most other installation will not run in a > network environment with a setup as crazy as ours. What's so crazy about your set up? Unless your network is very flaky and/or you have not tuned your OSSes properly, there should be no need for such a high timeout and if there is you need to address the problems requiring it. > Putting the timeout > to 100 immediately results in "Transport endpoint" errors, impossible to > run Lustre like this. 300 is the max that we recommend and we have very large production clusters that use such values successfully. > Since this is a 1.6.5.1 system, I activated the adaptive timeouts - and > put them to equally large values, > /sys/module/ptlrpc/parameters/at_max = 6000 > /sys/module/ptlrpc/parameters/at_history = 6000 > /sys/module/ptlrpc/parameters/at_early_margin = 50 > /sys/module/ptlrpc/parameters/at_extra = 30 This is likely not good as well. I will let somebody more knowledgeable about AT comment in detail though. It's a new feature and not getting wide use at all yet, so the real-world experience is still low. b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
