[ 
https://issues.apache.org/jira/browse/TEZ-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437400#comment-16437400
 ] 

Jason Lowe commented on TEZ-3914:
---------------------------------

Thanks for the report and patch!  Many of the unit test failures are related.  
Could you elaborate a bit more on the approach taken for the fix?  It's a 
rather sizeable patch, and a high-level overview would help for the review.  

> Recovering a large DAG hang job
> -------------------------------
>
>                 Key: TEZ-3914
>                 URL: https://issues.apache.org/jira/browse/TEZ-3914
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>            Priority: Major
>         Attachments: TEZ-3914.001.patch, TEZ-3914.002.patch
>
>
> Any failure to parse recovery event is ignore and treated as eof. Job can 
> hang since some task completions may be missed and shuffle will hang.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to