On Tue, Apr 07, 2009 at 08:14:55AM -0400, Brian J. Murrell wrote: [snip] > If the lost client has a transaction that needs to be replayed, all of > the transactions up to that missing transaction are replayed but all > subsequent transactions are discarded and when the recovery timer > expires, recovery is aborted. [snip]
Discarding all transactions causes a lot of collateral damage in a multi-cluster, mixed parallel job environment where "file-per-process" style I/O predominates. Could somebody remind me of the use cases protected by this behavior? In the case of I/O to a shared file, aren't lustre's errror handling obligations met by evicting the single offending client? Perhaps I am thinking too provincially because in our environment, I/O to shared files generally (always?) takes place in the context of a parallel job, and the single client eviction and EIO (or reboot of client) should be sufficient to terminate the whole job with an error. Thanks, Jim _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
