Looking at the error handling logic that I'll need to do: The characteristics
of the transport are such that once I send short messages there's no way to
call them back, so the best I can do there is wait for them all to complete
(which might involve a failure code) and then clean them up.  For the dma ops,
there isn't a good (ie, free of race conditions) way to tell what the
instantaneous state of the transfer is, so the best I can do is invalidate the
state of the various buffer descriptors, and wait for them to trickle out,
presumably with errors.  That's true for both tx and rx ops.  

The one bit of good news is that we're pretty sure that, in the case of a node
failure or something, we can arrange to be sure within a small bounded time
(seconds) that all pending traffic is done.  In the case of a link failure
(which stops clocking the dma engine for that link) we can ensure that nothing
is happening there, which means it's safe to reset the engine, which means
it's safe to yank out the pending buffer descriptors without having it
scribble all over the memory later.

Given all that, I think my error-recovery plan is approx as follows:

For short-messages, ignore them, once they're gone, they're gone.  Ignore
those which are received from peers which aren't in "good" state, except of
course for hellos.

For rdma transmits, if I get a notification that the peer is down, invalidate
the descriptors which correspond to the buffers, and wait for them all to
return.  If they come back "ok", presumably that means I lost the race, and
they get finalized as normal.  All the ones that come back with errors get
finalized with some kind (what kind?) of error.  Is it safe to assume that in
that case lnet and friends will deal with whatever kind of recovery is
necessary, for instance if an alternate OSS comes on line, do they DTRT about
replaying any pending messages to the new peer?

For rdma receives, if I get a notification that the peer is down, I do the
same sort of thing; invalidate the descriptors.  In addition, if it's a link
failure, I must do the under-the-hood stuff to guarantee that the engine is
stopped, then by hand signal completion on the relevant buffers.  After that I
think the rest of the logic about finalizing applies.

Does that sound roughly right?  Anything else I should be taking into account?

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to