On Tue, 2009-04-07 at 08:34 -0700, Jim Garlick wrote: > > Discarding all transactions
Only transactions subsequent to a missing transaction. > causes a lot of collateral damage in a > multi-cluster, mixed parallel job environment where "file-per-process" > style I/O predominates. Indeed, depending where the AWOL client's transaction sits in the replay stream. So if it was the last transaction, the loss is absolutely minimal but if it was the first transaction, the loss is absolutely maximal. > Could somebody remind me of the use cases protected by this behavior? Simply transactional dependency. If you don't know what the AWOL client did to a given file, you cannot reliably process any further updates to that file, and if you don't have the AWOL client to ask what files it has transactions for, everything subsequent to that client's transaction has to be suspect. While I don't have any examples off-hand, I am sure one of the devs that constantly have their fingers in replay can cite many actual scenarios where this is a problem. > In the case of I/O to a shared file, aren't lustre's errror handling > obligations met by evicting the single offending client? No. All clients subsequently have to be evicted, per the above. > Perhaps I am > thinking too provincially because in our environment, I/O to shared > files generally (always?) takes place in the context of a parallel job, > and the single client eviction and EIO (or reboot of client) should > be sufficient to terminate the whole job with an error. Yours is probably a scenario where VBR will do really well then given that VBR only serializes replay on truly dependent transactions rather than the single serial stream (of assumed dependent transactions) that replay currently operates with. b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
