Re: [Lustre-discuss] Nodes claim error with files, then say everything is fine.

Chris Worley Wed, 06 Aug 2008 10:08:12 -0700

On Wed, Aug 6, 2008 at 10:45 AM, Brian J. Murrell <[EMAIL PROTECTED]> wrote:
> On Wed, 2008-08-06 at 10:41 -0600, Chris Worley wrote:
>>
>> Is there anything in /proc or /sys I can look at to see whatever
>> "keepalive" parameters are setup?
>
> All timeouts are based on the obd_timeout in /proc/sys/lustre/timeout
> which MUST be the same on all nodes.
>


Would you suggest I increase or decrease this value?

Is there a way to inhibit the eviction, or is that necessary to keep
really dead clients from locking-out files.

>> The systems aren't dying.
>
> They are failing to communicate with the MDS for some reason.  Network
> problems perhaps?  You could try enabling +rpctrace debug and inspecting
> the debug file for RPCs to see if the client is indeed sending something
> (even if it's a ping) at regular intervals.

All the systems (RHEL4 and 5 clients, Lustre servers) are on the same
ethernet and IB switches.  There were no issues before the 1.6.5.1
upgrade with the RHEL5 nodes.

Would a normal ping do it?  I can jury-rig all the RHEL5 nodes to ping the MDS.

Chris
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Nodes claim error with files, then say everything is fine.

Reply via email to