One fault tolerance issue that we know we need some clarity on is being tracked here: https://github.com/ofiwg/libfabric/issues/826
-Dave On Sep 8, 2015, at 12:40 PM, Jeff Hammond <[email protected]> wrote: > Some of the requirements for FT include: > - precise error code reporting on failures. deadlock never occurs due to > remote process failure. > - containment of side effects of endpoint failures, especially no byzantine > behavior. > - easy to deregister failed endpoints. > - easy to register new endpoints on the fly. (think MPI_Comm_spawn_multiple > here) > > Thanks, > > Jeff > > On Tue, Sep 8, 2015 at 10:28 AM, Sur, Sayantan <[email protected]> wrote: > > > > >What would be more helpful would be to have OFI provide a well-specified > >mechanism for reporting communication failures that it can’t > >automatically resolve. Some sort of error reporting from OFI calls to say > >that a specific send failed would be nice. From that error code, we can > >infer which target failed since OFI doesn’t have any collectives which > >would make this more difficult. > > > Errors should be reported to the CQ readerr. That’s what you want, right? > > Thanks, > Sayantan. > > > > > >Thanks, > >Wesley > > > > > > > >On 9/8/15, 11:57 AM, "[email protected] on behalf of > >Hefty, Sean" <[email protected] on behalf of > >[email protected]> wrote: > > > >>> What's the state of fault-tolerance in OFI? Would it be prudent for > >>> someone to write OFI code that aspired to survive process failures? > >>>Are > >>> any implementations known to support this robustly right now? > >> > >>This would be provider specific. I'm not aware of anything that's coded > >>to handle failures. > >> > >>Having an example of this over libfabric would be great, though I'm not > >>sure who's going to volunteer to write this. > >> > >>It's not clear to me how fault tolerance relates to a networking API. > >>For example, what specific lower-level features does an app need to make > >>this happen? Are their restrictions that providers need to report to > >>apps regarding their level of support? Is this something that even > >>belongs to this level of API? > >>_______________________________________________ > >>ofiwg mailing list > >>[email protected] > >>http://lists.openfabrics.org/mailman/listinfo/ofiwg > >_______________________________________________ > >ofiwg mailing list > >[email protected] > >http://lists.openfabrics.org/mailman/listinfo/ofiwg > > > > > -- > Jeff Hammond > [email protected] > http://jeffhammond.github.io/ > _______________________________________________ > ofiwg mailing list > [email protected] > http://lists.openfabrics.org/mailman/listinfo/ofiwg _______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
