Jason Lowe commented on TEZ-3914:

Thanks for the report and patch!  Many of the unit test failures are related.  
Could you elaborate a bit more on the approach taken for the fix?  It's a 
rather sizeable patch, and a high-level overview would help for the review.  

> Recovering a large DAG hang job
> -------------------------------
>                 Key: TEZ-3914
>                 URL: https://issues.apache.org/jira/browse/TEZ-3914
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>            Priority: Major
>         Attachments: TEZ-3914.001.patch, TEZ-3914.002.patch
> Any failure to parse recovery event is ignore and treated as eof. Job can 
> hang since some task completions may be missed and shuffle will hang.

This message was sent by Atlassian JIRA

Reply via email to