Hi,
On 21 Aug 2007, at 17:02, Martin Pokorny wrote:
Kai Germaschewski wrote:
We've been playing with using luster as root fs for our x86_64
based cluster. We've run into quite some stability problems, with
arbitrary processes on the nodes disappearing, like sshd, gmond,
the Myrinet mapper or whatever.
My cluster is seeing similar problems. I've got a heterogeneous
cluster with both x86_64 and i386 nodes, and I'm not using Lustre
as the root fs, but I've noticed similar problems as you've described.
We have similar problem on our x86_64 cluster. We are using lustre as
scratch file system for running jobs.
We're running 2.6.18-vanilla + lustre 1.6.1, the filesystem being
mounted read-only. MGS/MDS/OST are all on one server node. I've
trouble understanding most of the things that lustre is writing to
the logs, any pointers to additional docs would be appreciated.
Here I'm running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/
MDS, a few OSS nodes, and a few clients. Mostly I'm using the node
hosting the MGS/MDS as a Lustre client. Network is TCP.
we are running same kernel 2.6.9-55.EL_lustre-1.6.1smp: One MGS/MDS
one OSS and hundreds of nodes.
One consistently recurring problem is
LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue())
ldlm_cli_enqueue: -2
on the client.
I'm seeing exactly the same messages.
Exactly the same behavior. These messages appear in logs when jobs
starts.
Last night, in addition clients seemed to be evicted regularly
(and reconnecting) even though they were up, which may be where
the random processes died. Currently, we're running with only one
client, which seems to be stable except for the error above
repeating itself.
Occasionally I see messages similar to the following:
LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@
timeout (sent at 1187711017, 50s ago) [EMAIL PROTECTED]
x25961999/t0 o400->[EMAIL PROTECTED]@tcp:28 lens
128/128 ref 1 fl Rpc:N/0/0 rc 0/-22
which is concurrent with a long pause in fs access. As far as I can
tell, recovery is then successful, and the jobs keep running. The
main effect seems to be that file operations on the Lustre
filesystem are greatly slowed.
I can see exactly the same behavior but some times clients doesn't
recover
--
Martin
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Wojciech
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: [EMAIL PROTECTED]
tel. +441223763517
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss