A thread reading the result corresponding to an orphaned task can indeed cause hang. Good catch. The exceptions as well as task results can be passed across threads using std::shared_future. If a task thread exited with an exception, the caller of std::future::get will get an exception. Assuming the exiting thread stored the exception in the corresponding std::promise. No exception should escape the boundary of the thread that threw it. And the top-level thread can then translate the exception into error string and report back gracefully.
Eftiquar On 1/19/18, 1:47 PM, "Li, Mu" <[email protected]> wrote: Very good document, thanks! One issue with approach 1 is that resuming the operator after the failed one may cause error and even system hang. Say if op A writes var V while op B reads V. Then B will not be excited if A is failed, unless we clear their dependencies, but it will lead to wrong results as well. Best Mu > On Jan 19, 2018, at 10:07 AM, Anirudh <[email protected]> wrote: > > Hi, > > I have outlined the approach and proof of concept for Better Exception > Handling in MXNet. Please provide feedback/comments/suggestions in the > comments section of the wiki. > > https://cwiki.apache.org/confluence/display/MXNET/Improved+exception+handling+in+MXNet > > > Note: Responses will be delayed till 01/22/2018. > > Anirudh
