Hi,

On 21 Aug 2007, at 17:02, Martin Pokorny wrote:

Kai Germaschewski wrote:
We've been playing with using luster as root fs for our x86_64 based cluster. We've run into quite some stability problems, with arbitrary processes on the nodes disappearing, like sshd, gmond, the Myrinet mapper or whatever.

My cluster is seeing similar problems. I've got a heterogeneous cluster with both x86_64 and i386 nodes, and I'm not using Lustre as the root fs, but I've noticed similar problems as you've described.
We have similar problem on our x86_64 cluster. We are using lustre as scratch file system for running jobs.
We're running 2.6.18-vanilla + lustre 1.6.1, the filesystem being mounted read-only. MGS/MDS/OST are all on one server node. I've trouble understanding most of the things that lustre is writing to the logs, any pointers to additional docs would be appreciated.

Here I'm running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/ MDS, a few OSS nodes, and a few clients. Mostly I'm using the node hosting the MGS/MDS as a Lustre client. Network is TCP.
we are running same kernel 2.6.9-55.EL_lustre-1.6.1smp: One MGS/MDS one OSS and hundreds of nodes.

One consistently recurring problem is
LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2
on the client.

I'm seeing exactly the same messages.
Exactly the same behavior. These messages appear in logs when jobs starts.

Last night, in addition clients seemed to be evicted regularly (and reconnecting) even though they were up, which may be where the random processes died. Currently, we're running with only one client, which seems to be stable except for the error above repeating itself.

Occasionally I see messages similar to the following:

LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187711017, 50s ago) [EMAIL PROTECTED] x25961999/t0 o400->[EMAIL PROTECTED]@tcp:28 lens 128/128 ref 1 fl Rpc:N/0/0 rc 0/-22

which is concurrent with a long pause in fs access. As far as I can tell, recovery is then successful, and the jobs keep running. The main effect seems to be that file operations on the Lustre filesystem are greatly slowed.
I can see exactly the same behavior but some times clients doesn't recover

--
Martin

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Wojciech

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: [EMAIL PROTECTED]
tel. +441223763517



_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to