Thank you for your feedback, Mu and Eftiquar! Lets say op A writes var V and op B reads V. If there is an exception thrown during execution of A, the callback for A will still explicitly be called. This will enforce that all the dependencies are cleared and the write dependencies have exception_ptr member set if needed. This will prevent system hang. I have modified the wiki to make this point clear.
As Mu mentioned, One drawback for this approach is that this will lead to wrong results for the operators being executed which are dependent on the failed operator. I thought this behavior to be alright if we document it clearly that all operators depending on failed operator will produce unreliable results. If this is not acceptable for users, and a failure of an operator must stop execution of all operators depending on it, then we need to use approach 2, where we use an on-start callback which helps decide before an operator executes whether to execute it or not. Please advise. Anirudh On Sat, Jan 20, 2018 at 3:14 PM, Shaikh, Eftiquar <[email protected]> wrote: > A thread reading the result corresponding to an orphaned task can indeed > cause hang. Good catch. > The exceptions as well as task results can be passed across threads using > std::shared_future. If a task thread exited with an exception, the caller > of std::future::get will get an exception. Assuming the exiting thread > stored the exception in the corresponding std::promise. > No exception should escape the boundary of the thread that threw it. > And the top-level thread can then translate the exception into error > string and report back gracefully. > > Eftiquar > > > > > On 1/19/18, 1:47 PM, "Li, Mu" <[email protected]> wrote: > > Very good document, thanks! > > One issue with approach 1 is that resuming the operator after the > failed one may cause error and even system hang. Say if op A writes var V > while op B reads V. Then B will not be excited if A is failed, unless we > clear their dependencies, but it will lead to wrong results as well. > > Best > Mu > > > On Jan 19, 2018, at 10:07 AM, Anirudh <[email protected]> wrote: > > > > Hi, > > > > I have outlined the approach and proof of concept for Better > Exception > > Handling in MXNet. Please provide feedback/comments/suggestions in > the > > comments section of the wiki. > > > > https://cwiki.apache.org/confluence/display/MXNET/Improved+e > xception+handling+in+MXNet > > > > > > Note: Responses will be delayed till 01/22/2018. > > > > Anirudh > > >
