On Aug 23, 2007  11:25 +0100, Wojciech Turek wrote:
> We have large cluster installation with lustre 1.6.1
> We have 550 clients and one MDS/MGS and one OSS with one OST. Back- 
> end storage is DELL MD1000
>
> We are currently experiencing lot's of difficulties caused by clients  
> loosing connection to lustre file system. Sometime clients retrieve  
> connection but job that has been running on that clients are usually  
> dead and some time clients just hangs waiting for reconnection with  
> lustre file system.

I suspect having 500 clients doing IO to a single OST is overloading it.
While there are setups with many thousands of clients, they also have
more OSTs to spread the load over.

> ptlrpc_expire_one_request()) @@@ timeout (sent at 1187860762, 100s ago)

You at least need to increase the timeout on the clients & servers,
I'd suggest 300s for such a heavily loaded system.

> ----------------------------------------------- 
> OSS--------------------------------------------------------------------- 
> --------------------------------------------------------
> ()) Watchdog triggered for pid 15618: it was inactive for 100s
> Aug 23 10:02:02 storage02 kernel: Lustre: 0:0:(linux-debug.c: 
> 168:libcfs_debug_dumpstack()) showing stack for process 15618
> Aug 23 10:02:02 storage02 kernel: ll_ost_251    S  
> {:ptlrpc:ptlrpc_expired_set+0} <ffffffffa03fa380> 
> {:ptlrpc:ptlrpc_interrupted_set+0}
> Aug 23 10:02:02 storage02 kernel:        <ffffffffa03d154b> 
> {:ptlrpc:ldlm_send_and_maybe_create_set+27}

This looks to be related to parallel lock cancellations.  Can you try
reducing PARALLEL_AST_LIMIT to, say, 64 and see if that improves the
situation?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to