On Thu, 2009-07-16 at 18:59 -0400, Josephine Palencia wrote: > > What determines the speed at which a lustre fs will recover? (ex. after a > crash)
How fast all of the clients can reconnect and replay their pending transactions. > Can (should) one hasten the recovery by tweaking some parameters? There's not much to tweak. Recovery waits for a) all clients to reconnect and replay or b) the recovery timer to run out. The recovery timer is a factor of obd_timeout. As you probably know obd_timeout has a value below which you will start to see timeouts and evictions -- which you don't want of course. So you don't really want to set it below that value. The first question people tend ask when they discover they need to tune their obd_timeout upwards to avoid lock callback timeouts and so forth is "why don't I just set that really high then?". The answer is always "because the higher you set it, the longer your recovery process will take in the event that not all clients are available to replay. Of course the bigger your client count the higher the odds that the recovery timeout is your deciding factor and not all clients being available to connect. Of interest to all of this is that in 1.8, adaptive timeouts (AT) are enabled by default, so obd_timeout should generally always be high enough without being too high -- i.e. optimal. So if your OSSes and MDS are tuned such that they are not overwhelming their disk backend, obd_timeout should be reasonable and therefore recovery should be reasonable. > For 4 OSTS each with 7TB, ~40 connected clients , recovery time > is 48min. Is that reasonable or is that too long? Wow. That seems long. That is recovery of what? A single OST or single OSS, or something other? b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
