John,

> Given all that, I think my error-recovery plan is approx as follows:
> 
> For short-messages, ignore them, once they're gone, they're gone.
> Ignore those which are received from peers which aren't in "good"
> state, except of course for hellos.

Fine

> For rdma transmits, if I get a notification that the peer is down,
> invalidate the descriptors which correspond to the buffers, and wait
> for them all to return.  If they come back "ok", presumably that
> means I lost the race, and they get finalized as normal.  All the
> ones that come back with errors get finalized with some kind (what
> kind?) of error.  

Pass any non-zero completion status to flag an error -
e.g. lnet_finalize(ni, msg, -EIO)

> Is it safe to assume that in that case lnet and friends will deal
> with whatever kind of recovery is necessary, for instance if an
> alternate OSS comes on line, do they DTRT about replaying any
> pending messages to the new peer?

yes, that's not LNET or the LND's concern at all.

> For rdma receives, if I get a notification that the peer is down, I
> do the same sort of thing; invalidate the descriptors.  In addition,
> if it's a link failure, I must do the under-the-hood stuff to
> guarantee that the engine is stopped, then by hand signal completion
> on the relevant buffers.  After that I think the rest of the logic
> about finalizing applies.

Sounds fine.

> Does that sound roughly right?  Anything else I should be taking
> into account?

The guiding principles for completion are...

1. If you return success from lnd_send or lnd_recv, you must call
   lnet_finalize() within finite time.

2. You may only call lnet_finalize() when there is no longer any
   chance that the underlying network can touch (read or write) the
   payload buffer.

3. The completion status on sends isn't critical.  Lustre only really
   needs to know that sending is over; knowing whether the send was
   good or not is really just icing on the cake (e.g. so that it
   doens't have to wait for a full timeout for an RPC reply if sending
   the request failed).

4. The completion status on receives is completely critical.  You may
   only return success if the sink buffer has been filled correctly.

                Cheers,
                        Eric


_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to