Very good document, thanks! One issue with approach 1 is that resuming the operator after the failed one may cause error and even system hang. Say if op A writes var V while op B reads V. Then B will not be excited if A is failed, unless we clear their dependencies, but it will lead to wrong results as well.
Best Mu > On Jan 19, 2018, at 10:07 AM, Anirudh <[email protected]> wrote: > > Hi, > > I have outlined the approach and proof of concept for Better Exception > Handling in MXNet. Please provide feedback/comments/suggestions in the > comments section of the wiki. > > https://cwiki.apache.org/confluence/display/MXNET/Improved+exception+handling+in+MXNet > > > Note: Responses will be delayed till 01/22/2018. > > Anirudh
