Kai Germaschewski wrote:
We've been playing with using luster as root fs for our x86_64 based
cluster. We've run into quite some stability problems, with arbitrary
processes on the nodes disappearing, like sshd, gmond, the Myrinet mapper
or whatever.
My cluster is seeing similar problems. I've got a heterogeneous cluster
with both x86_64 and i386 nodes, and I'm not using Lustre as the root
fs, but I've noticed similar problems as you've described.
We're running 2.6.18-vanilla + lustre 1.6.1, the filesystem being mounted
read-only. MGS/MDS/OST are all on one server node. I've trouble
understanding most of the things that lustre is writing to the logs, any
pointers to additional docs would be appreciated.
Here I'm running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/MDS, a
few OSS nodes, and a few clients. Mostly I'm using the node hosting the
MGS/MDS as a Lustre client. Network is TCP.
One consistently recurring problem is
LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2
on the client.
I'm seeing exactly the same messages.
Last night, in addition clients seemed to be evicted regularly (and
reconnecting) even though they were up, which may be where the random
processes died. Currently, we're running with only one client, which seems
to be stable except for the error above repeating itself.
Occasionally I see messages similar to the following:
LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@
timeout (sent at 1187711017, 50s ago) [EMAIL PROTECTED] x25961999/t0
o400->[EMAIL PROTECTED]@tcp:28 lens 128/128 ref 1 fl
Rpc:N/0/0 rc 0/-22
which is concurrent with a long pause in fs access. As far as I can
tell, recovery is then successful, and the jobs keep running. The main
effect seems to be that file operations on the Lustre filesystem are
greatly slowed.
--
Martin
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss