> FWIW, 1000 is waaaaay high. Our biggest production systems (thousands > if not 10s of thousands) nodes don't use values higher than 300 seconds.
Since I'm here at LLNL and we happen to have a few of the large systems maybe I should chime in. While it is true our large systems (many thousands of nodes) use a timeout value of 300s, it is not true that they prevent all of our timeouts. The 300s value has just shown itself through actual usage to prevent 99% of our timeouts and still allow reasonable length recovery times. It certainly does not prevent all of our timeouts. To get to that point I feel the only viable solution is to validate the new adaptive timeout feature for our production use. -- Thanks, Brian
pgpwlRt1p3Ame.pgp
Description: PGP signature
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
