One of the things I can think of is having the provider attempt some link-level 
resilience where it attempts to fail-over to other paths if possible when a 
failure is detected. That’s somewhat low hanging fruit and probably not the 
responsibility of OFI itself.

What would be more helpful would be to have OFI provide a well-specified 
mechanism for reporting communication failures that it can’t automatically 
resolve. Some sort of error reporting from OFI calls to say that a specific 
send failed would be nice. From that error code, we can infer which target 
failed since OFI doesn’t have any collectives which would make this more 
difficult.

Thanks,
Wesley



On 9/8/15, 11:57 AM, "[email protected] on behalf of Hefty, 
Sean" <[email protected] on behalf of [email protected]> 
wrote:

>> What's the state of fault-tolerance in OFI?  Would it be prudent for
>> someone to write OFI code that aspired to survive process failures?  Are
>> any implementations known to support this robustly right now?
>
>This would be provider specific.  I'm not aware of anything that's coded to 
>handle failures.
>
>Having an example of this over libfabric would be great, though I'm not sure 
>who's going to volunteer to write this.
>
>It's not clear to me how fault tolerance relates to a networking API.  For 
>example, what specific lower-level features does an app need to make this 
>happen?  Are their restrictions that providers need to report to apps 
>regarding their level of support?  Is this something that even belongs to this 
>level of API?
>_______________________________________________
>ofiwg mailing list
>[email protected]
>http://lists.openfabrics.org/mailman/listinfo/ofiwg
_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg

Reply via email to