Re: [Lustre-discuss] Luster recovery when clients go away

Andreas Dilger Thu, 31 Jul 2008 10:55:24 -0700

On Jul 31, 2008  10:30 -0400, Brock Palen wrote:
> One of our OSS's died with a panic last night.  Between when it was  
> found (no failover) and restarted two clients had died also.  (nodes  
> crashed by user OOM).
> 
> Because of this the OST's now are looking for 626  clients to recover  
> when only 624 are up.  So the 624 recover in about 15 minutes, but  
> the OST's on that OSS hang waiting for the last two that are dead and  
> not coming back.  Note the MDS reports only 624 clients.
> 
> Is there a a way to tell the OST's to go ahead and evict those two  
> clients and finish recovering?  Also "time remaining" has been 0  
> sense it was booted.  How long will the OST's wait before it lets  
> operations continue?
> 
> Is there any rule to speeding up recovery?  The OSS that crashed sees  
> very little cpus/disk/network traffic when recovery is going on so  
> any way to speed it up even if it results in a higher load would be  
> great to know.
> 
> status: RECOVERING
> recovery_start: 1217509142
> time remaining: 0
> connected_clients: 624/626
> completed_clients: 624/626
> replayed_requests: 0/??
> queued_requests: 0
> next_transno: 175342162
> status: RECOVERING
> recovery_start: 1217509144
> time remaining: 0
> connected_clients: 624/626
> completed_clients: 624/626
> replayed_requests: 0/??
> queued_requests: 0
> next_transno: 193097794


The recovery should time out after about 5 minutes (with default 100s
timeouts).  The recovery goes as fast as clients connect and submit RPCs
for replay.  In the case where all clients connect then recovery is
finished as soon as all clients report completion.

Are you saying the system is still stuck in recovery after more than 5
or 10 minutes?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Luster recovery when clients go away

Reply via email to