One of the things I can think of is having the provider attempt some link-level resilience where it attempts to fail-over to other paths if possible when a failure is detected. That’s somewhat low hanging fruit and probably not the responsibility of OFI itself.
What would be more helpful would be to have OFI provide a well-specified mechanism for reporting communication failures that it can’t automatically resolve. Some sort of error reporting from OFI calls to say that a specific send failed would be nice. From that error code, we can infer which target failed since OFI doesn’t have any collectives which would make this more difficult. Thanks, Wesley On 9/8/15, 11:57 AM, "[email protected] on behalf of Hefty, Sean" <[email protected] on behalf of [email protected]> wrote: >> What's the state of fault-tolerance in OFI? Would it be prudent for >> someone to write OFI code that aspired to survive process failures? Are >> any implementations known to support this robustly right now? > >This would be provider specific. I'm not aware of anything that's coded to >handle failures. > >Having an example of this over libfabric would be great, though I'm not sure >who's going to volunteer to write this. > >It's not clear to me how fault tolerance relates to a networking API. For >example, what specific lower-level features does an app need to make this >happen? Are their restrictions that providers need to report to apps >regarding their level of support? Is this something that even belongs to this >level of API? >_______________________________________________ >ofiwg mailing list >[email protected] >http://lists.openfabrics.org/mailman/listinfo/ofiwg _______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
