Re: [Lustre-discuss] Recovery without end

Brian J. Murrell Wed, 25 Feb 2009 08:37:31 -0800

On Wed, 2009-02-25 at 11:22 -0500, Charles Taylor wrote:
> I know you will be tempted to tell us that our network must be flakey  
> but it simply is not.   We'd love to understand why we need such a  
> large timeout value and why, if we don't use a large value, we see  
> these transport end-point failures.    However, after spending several  
> days trying to understand and resolve the issue, we finally just  
> accepted the long timeout as a suitable workaround.


I'd encourage you to upgrade to the latest version of Lustre (just so we
are not chasing possibly old and fixed bugs) and re-evaluate your
timeout and report how it works out for you.  If you still see
unreliability, then file a bug.

I'd also suggest (if you have not already done it) that you use the
iokit to be sure your OSSes are properly tuned for the storage bandwidth
they have available to them and not tying up OST processes for overly
long periods of time waiting for storage access.

> I wonder if there are others who have silently done the same.   We'll  
> be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future.    Maybe  
> then we'll be able to do away with the long timeout value but until  
> then, we need it.  :(

Sounds like a good idea.

b.

signature.asc
Description: This is a digitally signed message part

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Recovery without end

Reply via email to