> >What would be more helpful would be to have OFI provide a well-specified >mechanism for reporting communication failures that it can’t >automatically resolve. Some sort of error reporting from OFI calls to say >that a specific send failed would be nice. From that error code, we can >infer which target failed since OFI doesn’t have any collectives which >would make this more difficult.
Errors should be reported to the CQ readerr. That’s what you want, right? Thanks, Sayantan. > >Thanks, >Wesley > > > >On 9/8/15, 11:57 AM, "[email protected] on behalf of >Hefty, Sean" <[email protected] on behalf of >[email protected]> wrote: > >>> What's the state of fault-tolerance in OFI? Would it be prudent for >>> someone to write OFI code that aspired to survive process failures? >>>Are >>> any implementations known to support this robustly right now? >> >>This would be provider specific. I'm not aware of anything that's coded >>to handle failures. >> >>Having an example of this over libfabric would be great, though I'm not >>sure who's going to volunteer to write this. >> >>It's not clear to me how fault tolerance relates to a networking API. >>For example, what specific lower-level features does an app need to make >>this happen? Are their restrictions that providers need to report to >>apps regarding their level of support? Is this something that even >>belongs to this level of API? >>_______________________________________________ >>ofiwg mailing list >>[email protected] >>http://lists.openfabrics.org/mailman/listinfo/ofiwg >_______________________________________________ >ofiwg mailing list >[email protected] >http://lists.openfabrics.org/mailman/listinfo/ofiwg _______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
