On Wed, 2009-02-25 at 11:22 -0500, Charles Taylor wrote: > I know you will be tempted to tell us that our network must be flakey > but it simply is not. We'd love to understand why we need such a > large timeout value and why, if we don't use a large value, we see > these transport end-point failures. However, after spending several > days trying to understand and resolve the issue, we finally just > accepted the long timeout as a suitable workaround.
I'd encourage you to upgrade to the latest version of Lustre (just so we are not chasing possibly old and fixed bugs) and re-evaluate your timeout and report how it works out for you. If you still see unreliability, then file a bug. I'd also suggest (if you have not already done it) that you use the iokit to be sure your OSSes are properly tuned for the storage bandwidth they have available to them and not tying up OST processes for overly long periods of time waiting for storage access. > I wonder if there are others who have silently done the same. We'll > be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future. Maybe > then we'll be able to do away with the long timeout value but until > then, we need it. :( Sounds like a good idea. b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
