Re: [Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

Martin Pokorny Tue, 21 Aug 2007 09:02:53 -0700

Kai Germaschewski wrote:

We've been playing with using luster as root fs for our x86_64 basedcluster. We've run into quite some stability problems, with arbitraryprocesses on the nodes disappearing, like sshd, gmond, the Myrinet mapperor whatever.

My cluster is seeing similar problems. I've got a heterogeneous clusterwith both x86_64 and i386 nodes, and I'm not using Lustre as the rootfs, but I've noticed similar problems as you've described.

We're running 2.6.18-vanilla + lustre 1.6.1, the filesystem being mountedread-only. MGS/MDS/OST are all on one server node. I've troubleunderstanding most of the things that lustre is writing to the logs, anypointers to additional docs would be appreciated.

Here I'm running the 2.6.9-55.EL_lustre-1.6.1smp kernel. One MGS/MDS, afew OSS nodes, and a few clients. Mostly I'm using the node hosting theMGS/MDS as a Lustre client. Network is TCP.

One consistently recurring problem is

LustreError: 11169:0:(mdc_locks.c:420:mdc_enqueue()) ldlm_cli_enqueue: -2

on the client.


I'm seeing exactly the same messages.

Last night, in addition clients seemed to be evicted regularly (andreconnecting) even though they were up, which may be where the randomprocesses died. Currently, we're running with only one client, which seemsto be stable except for the error above repeating itself.


Occasionally I see messages similar to the following:

LustreError: 3707:0:(client.c:962:ptlrpc_expire_one_request()) @@@timeout (sent at 1187711017, 50s ago) [EMAIL PROTECTED] x25961999/t0o400->[EMAIL PROTECTED]@tcp:28 lens 128/128 ref 1 flRpc:N/0/0 rc 0/-22

which is concurrent with a long pause in fs access. As far as I cantell, recovery is then successful, and the jobs keep running. The maineffect seems to be that file operations on the Lustre filesystem aregreatly slowed.


--
Martin

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] lustre 1.6.1: ldlm_cli_enqueue: -2 errors

Reply via email to