> FWIW, 1000 is waaaaay high.  Our biggest production systems (thousands
> if not 10s of thousands) nodes don't use values higher than 300 seconds.

Since I'm here at LLNL and we happen to have a few of the large systems maybe 
I should chime in.  While it is true our large systems (many thousands of 
nodes) use a timeout value of 300s, it is not true that they prevent all of 
our timeouts.  The 300s value has just shown itself through actual usage to 
prevent 99% of our timeouts and still allow reasonable length recovery times.  
It certainly does not prevent all of our timeouts.  To get to that point I 
feel the only viable solution is to validate the new adaptive timeout feature 
for our production use.

-- 
Thanks,
Brian

Attachment: pgpwlRt1p3Ame.pgp
Description: PGP signature

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to