On Aug 23, 2007 11:25 +0100, Wojciech Turek wrote:
> We have large cluster installation with lustre 1.6.1
> We have 550 clients and one MDS/MGS and one OSS with one OST. Back-
> end storage is DELL MD1000
>
> We are currently experiencing lot's of difficulties caused by clients
> loosing connection to lustre file system. Sometime clients retrieve
> connection but job that has been running on that clients are usually
> dead and some time clients just hangs waiting for reconnection with
> lustre file system.
I suspect having 500 clients doing IO to a single OST is overloading it.
While there are setups with many thousands of clients, they also have
more OSTs to spread the load over.
> ptlrpc_expire_one_request()) @@@ timeout (sent at 1187860762, 100s ago)
You at least need to increase the timeout on the clients & servers,
I'd suggest 300s for such a heavily loaded system.
> -----------------------------------------------
> OSS---------------------------------------------------------------------
> --------------------------------------------------------
> ()) Watchdog triggered for pid 15618: it was inactive for 100s
> Aug 23 10:02:02 storage02 kernel: Lustre: 0:0:(linux-debug.c:
> 168:libcfs_debug_dumpstack()) showing stack for process 15618
> Aug 23 10:02:02 storage02 kernel: ll_ost_251 S
> {:ptlrpc:ptlrpc_expired_set+0} <ffffffffa03fa380>
> {:ptlrpc:ptlrpc_interrupted_set+0}
> Aug 23 10:02:02 storage02 kernel: <ffffffffa03d154b>
> {:ptlrpc:ldlm_send_and_maybe_create_set+27}
This looks to be related to parallel lock cancellations. Can you try
reducing PARALLEL_AST_LIMIT to, say, 64 and see if that improves the
situation?
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss