Looking at the error handling logic that I'll need to do: The characteristics of the transport are such that once I send short messages there's no way to call them back, so the best I can do there is wait for them all to complete (which might involve a failure code) and then clean them up. For the dma ops, there isn't a good (ie, free of race conditions) way to tell what the instantaneous state of the transfer is, so the best I can do is invalidate the state of the various buffer descriptors, and wait for them to trickle out, presumably with errors. That's true for both tx and rx ops.
The one bit of good news is that we're pretty sure that, in the case of a node failure or something, we can arrange to be sure within a small bounded time (seconds) that all pending traffic is done. In the case of a link failure (which stops clocking the dma engine for that link) we can ensure that nothing is happening there, which means it's safe to reset the engine, which means it's safe to yank out the pending buffer descriptors without having it scribble all over the memory later. Given all that, I think my error-recovery plan is approx as follows: For short-messages, ignore them, once they're gone, they're gone. Ignore those which are received from peers which aren't in "good" state, except of course for hellos. For rdma transmits, if I get a notification that the peer is down, invalidate the descriptors which correspond to the buffers, and wait for them all to return. If they come back "ok", presumably that means I lost the race, and they get finalized as normal. All the ones that come back with errors get finalized with some kind (what kind?) of error. Is it safe to assume that in that case lnet and friends will deal with whatever kind of recovery is necessary, for instance if an alternate OSS comes on line, do they DTRT about replaying any pending messages to the new peer? For rdma receives, if I get a notification that the peer is down, I do the same sort of thing; invalidate the descriptors. In addition, if it's a link failure, I must do the under-the-hood stuff to guarantee that the engine is stopped, then by hand signal completion on the relevant buffers. After that I think the rest of the logic about finalizing applies. Does that sound roughly right? Anything else I should be taking into account? _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
