Re: [Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

Wojciech Turek Thu, 23 Aug 2007 13:41:12 -0700

Hi,

On 21 Aug 2007, at 17:02, Martin Pokorny wrote:

Kai Germaschewski wrote:
We've been playing with using luster as root fs for our x86_64based cluster. We've run into quite some stability problems, witharbitrary processes on the nodes disappearing, like sshd, gmond,the Myrinet mapper or whatever.
My cluster is seeing similar problems. I've got a heterogeneouscluster with both x86_64 and i386 nodes, and I'm not using Lustreas the root fs, but I've noticed similar problems as you've described.

We have similar problem on our x86_64 cluster. We are using lustre asscratch file system for running jobs.

We're running 2.6.18-vanilla + lustre 1.6.1, the filesystem beingmounted read-only. MGS/MDS/OST are all on one server node. I'vetrouble understanding most of the things that lustre is writing tothe logs, any pointers to additional docs would be appreciated.
Here I'm running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/MDS, a few OSS nodes, and a few clients. Mostly I'm using the nodehosting the MGS/MDS as a Lustre client. Network is TCP.

we are running same kernel 2.6.9-55.EL_lustre-1.6.1smp: One MGS/MDSone OSS and hundreds of nodes.

One consistently recurring problem is
LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue())ldlm_cli_enqueue: -2
on the client.
I'm seeing exactly the same messages.

Exactly the same behavior. These messages appear in logs when jobsstarts.

Last night, in addition clients seemed to be evicted regularly(and reconnecting) even though they were up, which may be wherethe random processes died. Currently, we're running with only oneclient, which seems to be stable except for the error aboverepeating itself.
Occasionally I see messages similar to the following:
LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@timeout (sent at 1187711017, 50s ago) [EMAIL PROTECTED]x25961999/t0 o400->[EMAIL PROTECTED]@tcp:28 lens128/128 ref 1 fl Rpc:N/0/0 rc 0/-22
which is concurrent with a long pause in fs access. As far as I cantell, recovery is then successful, and the jobs keep running. Themain effect seems to be that file operations on the Lustrefilesystem are greatly slowed.

I can see exactly the same behavior but some times clients doesn'trecover


--
Martin

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss


Wojciech

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: [EMAIL PROTECTED]
tel. +441223763517

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

Reply via email to